Apparatus, System, and Method for Data Block Usage Information Synchronization for a Non-Volatile Storage Volume

ABSTRACT

An apparatus, system, and method are disclosed for data block usage information synchronization for a non-volatile storage volume. The method includes referencing first data block usage information for data blocks of a non-volatile storage volume managed by a storage manager. The first data block usage information is maintained by the storage manager. The method also includes synchronizing second data block usage information managed by a storage controller with the first data block usage information maintained by the storage manager. The storage manager maintains the first data block usage information separate from second data block usage information managed by the storage controller.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.14/569,021 filed Dec. 12, 2014, which is a continuation of U.S.application Ser. No. 12/711,113 filed Feb. 23, 2010, now U.S. Pat. No.8,935,302, which is a continuation-in-part of U.S. application Ser. No.11/952,109, now U.S. Pat. No. 8,296,337, which is a continuation-in-partof U.S. application Ser. No. 11/952,113 filed Dec. 6, 2007, now U.S.Pat. No. 8,261,005, which claim the benefit of U.S. ProvisionalApplication No. 60/974,470 filed Sep. 22, 2007, and U.S. ProvisionalApplication No. 60/873,111 filed Dec. 6, 2006, each of which areincorporated herein by reference.

FIELD OF THE INVENTION

This invention relates to block storage on a non-volatile storage volumeand more particularly relates to data block usage informationsynchronization for a non-volatile storage volume.

BACKGROUND Description of the Related Art

Many conventional storage devices treat storage block addresses receivedfrom a storage client as logical block addresses having a one-to-onedirect mapping to a corresponding physical addresses on a storage mediawhere data is actually stored. For storage devices that maintain amapping from a logical block address to an arbitrary physical address,conventional storage clients (operating systems, file systems, volumemangers, and the like) have begun to communicate when data on physicalmedia corresponding to a logical block address no longer needs to beretained. This unused data block usage information enables deallocationof the corresponding physical blocks, and/or stops preserving the datain the corresponding physical blocks. As a result, data on the storagedevice corresponding to logical blocks that are not in use by a storageclient, is no longer unnecessarily preserved by the storage device.Without this capability, unused data blocks must be preserved by thestorage device as used data blocks, which slows performance and requiresadditional unnecessary overhead to maintain.

However, certain storage clients are not designed to communicate unuseddata block usage information. Additionally, certain storage clients thathave the ability to communicate unused data block usage information doso ineffectively or lack the ability to communicate unused data blockusage information for certain storage configurations. In addition, incertain storage configurations, even though the unused block usageinformation is communicated, the information is not passed on to thestorage device.

SUMMARY

The present invention has been developed in response to the presentstate of the art, and in particular, in response to the problems andneeds in the art that have not yet been fully solved by currentlyavailable storage systems. Accordingly, the present invention has beendeveloped to provide an apparatus, system, and method for data blockusage information synchronization for a non-volatile storage volume thatovercome many or all of the above-discussed shortcomings in the art.

The method for data block usage information synchronization for anon-volatile storage volume includes referencing first data block usageinformation for data blocks of a non-volatile storage volume managed bya storage manager. The first data block usage information is maintainedby the storage manager. The method also includes synchronizing seconddata block usage information managed by a storage controller with thefirst data block usage information maintained by the storage manager.The storage manager maintains the first data block usage informationseparate from second data block usage information managed by the storagecontroller.

In one embodiment, the method includes determining one or more unusedblocks from the first data block usage information and sending a messagedirectly to the storage controller directly managing the non-volatilestorage volume. The message indicates to the storage controller unusedblocks identified by the storage manager. The storage controllerdeallocates the unused blocks identified by the storage manager inresponse to the message. In one embodiment, synchronizing second datablock usage information further includes deallocating blocks identifiedby the storage controller as in use (or used) corresponding to one ormore unused blocks identified by the storage manager based on the firstdata block usage information.

In certain embodiments, synchronizing second data block usageinformation further includes determining that the storage controlleridentifies one or more unused blocks indicated by the first data blockusage information as used blocks and deallocating the used blocksidentified by the storage controller corresponding to the one or moreunused blocks. In one embodiment, the method includes updating the firstdata block usage information based on storage operations that modify thefirst data block usage information. The storage operations are executedby the storage controller subsequent to referencing the first data blockusage information and executed by the storage controller prior tosynchronizing the second data block usage information.

In one embodiment, referencing the first data block usage informationfurther includes referencing the first data block usage information byway of a storage Application Programming Interface (“API”) of thestorage manager. In one embodiment, the non-volatile storage volumeincludes a live volume actively servicing storage requests.

In one embodiment, the storage controller includes a redundant array ofindependent drives (“RAID”) controller storing data in a RAIDconfiguration on two or more storage devices. In this embodiment,synchronizing second data block usage information synchronizes seconddata block usage information managed for the two or more storage deviceswith the first data block usage information. The two or more storagedevices are managed by the RAID controller. In one embodiment, themethod further includes determining a RAID configuration of the RAIDcontroller and synchronizing second data block usage information managedfor the two or more storage devices with the first data block usageinformation based on the determined RAID configuration.

In one embodiment, the RAID controller manages one or moresub-controllers, each sub-controller storing data on one or more of thetwo or more storage devices. In certain embodiments, the RAIDconfiguration includes a RAID 0 configuration that stores data as astripe across the two or more storage devices. In this embodiment,synchronizing second data block usage information includes identifying afirst portion of the first data block usage information corresponding todata blocks stored on a first storage device. The method also includesidentifying a second portion of the first data block usage informationcorresponding to data blocks stored on a second storage device. Themethod also includes synchronizing second data block usage informationmanaged for the first storage device with the first portion of the firstdata block usage information. The method also includes synchronizingsecond data block usage information managed for the second storagedevice with the second portion of the first data block usageinformation.

In one embodiment, the RAID configuration includes a RAID 1configuration that mirrors data stored on a first storage device to asecond storage device. In this embodiment, synchronizing second datablock usage information includes synchronizing second data block usageinformation managed for the first storage device with the first datablock usage information. The method also includes synchronizing thesecond data block usage information managed for the second storagedevice with the first data block usage information.

In one embodiment, the RAID configuration includes a RAID 5configuration that stores data as a stripe across three or more storagedevices. The stripe includes two or more data strides and a distributedparity data stride. Each data stride is stored on a storage device andeach data stride includes one or more data blocks. In this embodiment,synchronizing second data block usage information includes determiningthat each data stride in the stripe comprises no used blocks based onthe first data block usage information. The method also includessynchronizing second data block usage information managed by the RAIDcontroller for the stripe by designating data blocks of the second datablock usage information corresponding to the stripe as unused.

In one embodiment, the RAID configuration includes a RAID 10configuration that mirrors a stride of data between two or more storagedevices using a RAID 1 configuration and that stores stripes of dataacross two or more storage device sets using a RAID 0 configuration. Inthis embodiment, synchronizing second data block usage informationincludes identifying a first portion of the first data block usageinformation corresponding to data blocks stored in a first stridemanaged by the RAID controller. The method also includes identifying asecond portion of the first data block usage information correspondingto data blocks stored in a second stride managed by the RAID controller.The method also includes synchronizing second data block usageinformation managed for the first stride with the first portion of thefirst data block usage information. The method also includessynchronizing second data block usage information managed for the secondstride with the second portion of the first data block usageinformation.

A computer program product is also provided for data management onnon-volatile storage media managed by a storage manager. The computerprogram product includes referencing a block map defining data blockusage information for data blocks of non-volatile storage media managedby a storage manager. The block map is maintained by the storagemanager. The computer program product also includes sending a messagedirectly to a storage controller managing the non-volatile storagemedia. The message identifies to the storage controller one or moreunused blocks identified by the block map. In one embodiment, thecomputer program product includes determining one or more unused blocksfrom the block map.

In one embodiment, the computer program product includes deallocatingused blocks identified by the storage controller corresponding to theone or more unused blocks identified by the storage manager in responseto the message. In certain embodiments, the computer program productincludes determining that the storage controller identifies the one ormore unused blocks as used blocks and the computer program productdeallocates the used blocks identified by the storage controllercorresponding to the one or more unused blocks in response to themessage.

In one embodiment, determining one or more unused blocks from the blockmap includes monitoring storage operations on data blocks represented bythe block map. The storage operations are executed by the storagecontroller subsequent to referencing the block map and executed by thestorage controller prior to deallocating the one or more unused blocksin response to the message. In this embodiment, determining one or moreunused blocks from the block map includes recording data block usageinformation for the storage operations that change unused blocks of theblock map to used blocks. In a further embodiment, recording data blockusage information for the storage operations further includes recordingthe data block usage information in an in-flight block map. In thisembodiment, the computer program product includes combining the blockmap and the in-flight block map to identify the one or more unusedblocks of the data blocks.

In one embodiment, the computer program product includes obtaining alock on a logical-to-physical map managed by the storage controllerprior to determining one or more unused blocks from the block map andreleasing the lock on the logical-to-physical map subsequent to thestorage controller deallocating the unused blocks. The storagecontroller stores data on the non-volatile storage media using anappend-only writing process and recovers storage space on thenon-volatile storage media using a storage space recovery process thatre-uses non-volatile storage media storing blocks that have becomeunused blocks.

In one embodiment, the message complies with an interface operable tocommunicate storage information between the storage manager and thestorage controller. The message includes a notification passing theunused blocks identified by the storage manager to the storagecontroller. The notification includes no requirement for action by thestorage controller in accordance with the interface. In certainembodiments, the message complies with an interface operable tocommunicate storage information between the storage manager and thestorage controller and includes a directive passing the unused blocksidentified by the storage manager to the storage controller. Thedirective requires the storage controller to erase the non-volatilestorage media comprising the unused blocks in accordance with theinterface.

In one embodiment, the message complies with an interface operable tocommunicate storage information between the storage manager and thestorage controller and the message includes a purge instruction passingthe unused blocks identified by the storage manager to the storagecontroller. The purge instruction requires the storage controller toerase the non-volatile storage media comprising the unused blocks and tooverwrite the unused blocks in accordance with the interface.

In one embodiment, the storage controller includes a redundant array ofindependent drives (“RAID”) controller storing data in a RAIDconfiguration on two or more storage devices managed by the RAIDcontroller. In certain embodiments, sending a message to the storagecontroller includes sending one or more messages communicating theunused blocks identified by the storage manager to one or moresub-controllers.

In one embodiment, the RAID configuration includes a RAID 0configuration that stores data as a stripe across the two or morestorage devices. In this embodiment, the computer program productfurther includes identifying a first portion of the block mapcorresponding to data blocks stored on a first storage device managed bythe RAID controller. The computer program product also includesidentifying a second portion of the block map corresponding to datablocks stored on a second storage device managed by the RAID controller.Sending a message to the storage controller includes sending a firstmessage to the RAID controller. The first message identifies one or moreunused blocks on the first storage device identified by the firstportion of the block map. Sending a message also includes sending asecond message to the RAID controller. The second message identifies oneor more unused blocks on the second storage device identified by thesecond portion of the block map.

In one embodiment, the RAID configuration includes a RAID 1configuration that mirrors data stored on a first storage device to asecond storage device. Sending a message to the storage controllerincludes sending a first message to the RAID controller managing thefirst storage device. The first message identifies one or more unusedblocks on the first storage device identified by the block map. Sendinga message to the storage controller also includes sending a secondmessage to the RAID controller managing the second storage device. Thesecond message identifies one or more unused blocks on the secondstorage device identified by the block map.

In one embodiment, the RAID configuration includes a RAID 5configuration that stores data as a stripe across three or more storagedevices. The stripe includes two or more data strides and a distributedparity data stride. Each data stride is stored on a storage device andeach data stride includes one or more data blocks. In this embodiment,the computer program product further includes determining that each datastride in the stripe comprises no used blocks based on the block map.Sending a message to the storage controller includes sending a messageto the RAID controller. The message designates data blocks correspondingto the stripe as unused.

In one embodiment, the RAID configuration includes a RAID 10configuration that that mirrors a stride of data between two or morestorage devices using a RAID 1 configuration and that stores stripes ofdata across two or more storage device sets using a RAID 0configuration. In this embodiment, the computer program product furtherincludes identifying a first portion of the block map corresponding todata blocks stored in a first stride managed by the RAID controller. Thecomputer program product may also include identifying a second portionof the block map corresponding to data blocks stored in a second stridemanaged by the RAID controller. Sending a message to the storagecontroller includes sending a first message to the RAID controllermanaging the first stride. The message identifies one or more unusedblocks in the first stride identified by the first portion of the blockmap. Sending a message to the storage controller also includes sending asecond message to the RAID controller managing the second stride. Thesecond message identifies one or more unused blocks in the second strideidentified by the second portion of the block map.

A system and method are also presented for data block usage informationsynchronization for a non-volatile storage volume managed by a storagemanager that includes the necessary components and steps to execute thefunctions described above in relation to the computer program product.In addition, the system includes a block usage synchronizer that, in oneembodiment, is initiated in response to one or more predeterminedevents. In certain embodiments, the block usage synchronizer isinitiated at a predetermined time interval.

Furthermore, the method includes calling a function of a storageApplication Programming Interface (“API”) to reference a block mapdefining data block usage information for a set of data blocks of anon-volatile storage volume, such as a flash storage volume. The blockmap is maintained by a storage manager. The non-volatile storage volumeis exclusively managed by a storage controller configured to use alogical-to-physical address translation layer that translates logicalblock addresses received from a storage client to physical blockaddresses on the non-volatile storage volume. In one embodiment, thenon-volatile storage volume includes one or more non-volatile storagemedia. In certain embodiments, the storage API includes adefragmentation API for block-oriented storage devices.

Reference throughout this specification to features, advantages, orsimilar language does not imply that all of the features and advantagesthat may be realized with the present invention should be or are in anysingle embodiment of the invention. Rather, language referring to thefeatures and advantages is understood to mean that a specific feature,advantage, or characteristic described in connection with an embodimentis included in at least one embodiment of the present invention. Thus,discussion of the features and advantages, and similar language,throughout this specification may, but do not necessarily, refer to thesame embodiment.

Furthermore, the described features, advantages, and characteristics ofthe invention may be combined in any suitable manner in one or moreembodiments. One skilled in the relevant art will recognize that theinvention may be practiced without one or more of the specific featuresor advantages of a particular embodiment. In other instances, additionalfeatures and advantages may be recognized in certain embodiments thatmay not be present in all embodiments of the invention.

These features and advantages of the present invention will become morefully apparent from the following description and appended claims, ormay be learned by the practice of the invention as set forthhereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

In order that the advantages of the invention will be readilyunderstood, a more particular description of the invention will berendered by reference to specific embodiments that are illustrated inthe appended drawings. Understanding that these drawings depict onlytypical embodiments of the invention and are not therefore to beconsidered to be limiting of its scope, the invention will be describedand explained with additional specificity and detail through the use ofthe accompanying drawings, in which:

FIG. 1 is a schematic block diagram illustrating one embodiment of asystem for data block usage information synchronization for anon-volatile storage volume in accordance with the present invention;

FIG. 2 is a schematic block diagram illustrating one embodiment of asolid-state storage device controller in a solid-state storage device inaccordance with the present invention;

FIG. 3 is a schematic block diagram illustrating one embodiment of asolid-state storage controller with a write data pipeline and a readdata pipeline in a solid-state storage device in accordance with thepresent invention;

FIG. 4 is a schematic block diagram illustrating one embodiment of abank interleave controller in the solid-state storage controller inaccordance with the present invention;

FIG. 5 is a schematic block diagram illustrating a logicalrepresentation of a solid-state storage controller in accordance withthe present invention;

FIG. 6 is a schematic block diagram illustrating one embodiment of asystem for data block usage information synchronization for anon-volatile storage volume in accordance with the present invention;

FIG. 7 is a schematic block diagram illustrating one embodiment of asystem for data block usage information synchronization for anon-volatile storage volume using a RAID controller in accordance withthe present invention;

FIG. 8 is a schematic block diagram illustrating another embodiment of asystem for data block usage information synchronization for anon-volatile storage volume using a RAID controller in accordance withthe present invention;

FIG. 9 is a schematic block diagram illustrating one embodiment of anapparatus for data block usage information synchronization for anon-volatile storage volume in accordance with the present invention;

FIG. 10 is a detailed schematic block diagram illustrating anotherembodiment of an apparatus for data block usage informationsynchronization for a non-volatile storage volume in accordance with thepresent invention;

FIG. 11 is a schematic block diagram illustrating an embodiment of anapparatus for data management on non-volatile storage media managed by astorage manager in accordance with the present invention;

FIG. 12 is a detailed schematic block diagram illustrating anotherembodiment of an apparatus for data management on non-volatile storagemedia managed by a storage manager in accordance with the presentinvention;

FIG. 13A is a schematic flow chart diagram illustrating one embodimentof a method for data block usage information synchronization for anon-volatile storage volume in accordance with the present invention;

FIG. 13B is a detailed schematic flow chart diagram illustrating anotherembodiment of a method for data block usage information synchronizationfor a non-volatile storage volume in accordance with the presentinvention;

FIG. 14 is a schematic flow chart diagram illustrating an embodiment ofa method for data management on non-volatile storage media managed by astorage manager in accordance with the present invention; and

FIG. 15 is a detailed schematic flow chart diagram illustrating anotherembodiment of a method for data management on non-volatile storage mediamanaged by a storage manager in accordance with the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Many of the functional units described in this specification have beenlabeled as modules, in order to more particularly emphasize theirimplementation independence. For example, a module may be implemented asa hardware circuit comprising custom VLSI circuits or gate arrays,off-the-shelf semiconductors such as logic chips, transistors, or otherdiscrete components. A module may also be implemented in programmablehardware devices such as field programmable gate arrays, programmablearray logic, programmable logic devices or the like.

Modules may also be implemented in software for execution by varioustypes of processors. An identified module of executable code may, forinstance, comprise one or more physical or logical blocks of computerinstructions which may, for instance, be organized as an object,procedure, or function. Nevertheless, the executables of an identifiedmodule need not be physically located together, but may comprisedisparate instructions stored in different locations which, when joinedlogically together, comprise the module and achieve the stated purposefor the module.

Indeed, a module of executable code may be a single instruction, or manyinstructions, and may even be distributed over several different codesegments, among different programs, and across several memory devices.Similarly, operational data may be identified and illustrated hereinwithin modules, and may be embodied in any suitable form and organizedwithin any suitable type of data structure. The operational data may becollected as a single data set, or may be distributed over differentlocations including over different storage devices, and may exist, atleast partially, merely as electronic signals on a system or network.Where a module or portions of a module are implemented in software, thesoftware portions are stored on one or more computer readable media.

Reference throughout this specification to “one embodiment,” “anembodiment,” or similar language means that a particular feature,structure, or characteristic described in connection with the embodimentis included in at least one embodiment of the present invention. Thus,appearances of the phrases “in one embodiment,” “in an embodiment,” andsimilar language throughout this specification may, but do notnecessarily, all refer to the same embodiment.

Reference to a computer readable medium may take any form capable ofstoring machine-readable instructions on a digital processing apparatusmemory device. A computer readable medium may be embodied by a compactdisk, digital-video disk, a magnetic tape, a Bernoulli drive, a magneticdisk, a punch card, flash memory (NAND or NOR), other types ofsolid-state memory, integrated circuits, or other digital processingapparatus memory device.

Furthermore, the described features, structures, or characteristics ofthe invention may be combined in any suitable manner in one or moreembodiments. In the following description, numerous specific details areprovided, such as examples of programming, software modules, userselections, network transactions, database queries, database structures,hardware modules, hardware circuits, hardware chips, etc., to provide athorough understanding of embodiments of the invention. One skilled inthe relevant art will recognize, however, that the invention may bepracticed without one or more of the specific details, or with othermethods, components, materials, and so forth. In other instances,well-known structures, materials, or operations are not shown ordescribed in detail to avoid obscuring aspects of the invention.

The schematic flow chart diagrams included herein are generally setforth as logical flow chart diagrams. As such, the depicted order andlabeled steps are indicative of one embodiment of the presented method.Other steps and methods may be conceived that are equivalent infunction, logic, or effect to one or more steps, or portions thereof, ofthe illustrated method. Additionally, the format and symbols employedare provided to explain the logical steps of the method and areunderstood not to limit the scope of the method. Although various arrowtypes and line types may be employed in the flow chart diagrams, theyare understood not to limit the scope of the corresponding method.Indeed, some arrows or other connectors may be used to indicate only thelogical flow of the method. For instance, an arrow may indicate awaiting or monitoring period of unspecified duration between enumeratedsteps of the depicted method. Additionally, the order in which aparticular method occurs may or may not strictly adhere to the order ofthe corresponding steps shown.

Solid-State Storage System

FIG. 1 is a schematic block diagram illustrating one embodiment of asystem 100 for improving performance in a solid-state storage device inaccordance with the present invention. The system 100 includes asolid-state storage device 102, a solid-state storage controller 104, awrite data pipeline 106, a read data pipeline 108, a solid-state storagemedia 110, a computer 112, a client 114, and a computer network 116,which are described below.

The system 100 includes at least one solid-state storage device 102. Inanother embodiment, the system 100 includes two or more solid-statestorage devices 102. Each solid-state storage device 102 may includenon-volatile, solid-state storage media 110, such as flash memory, nanorandom access memory (“nano RAM or NRAM”), magneto-resistive RAM(“MRAM”), dynamic RAM (“DRAM”), phase change RAM (“PRAM”), etc. Thesolid-state storage device 102 is described in more detail with respectto FIGS. 2 and 3. The solid-state storage device 102 is depicted in acomputer 112 connected to a client 114 through a computer network 116.In one embodiment, the solid-state storage device 102 is internal to thecomputer 112 and is connected using a system bus, such as a peripheralcomponent interconnect express (“PCI-e”) bus, a Serial AdvancedTechnology Attachment (“serial ATA”) bus, or the like. In anotherembodiment, the solid-state storage device 102 is external to thecomputer 112 and is connected, a universal serial bus (“USB”)connection, an Institute of Electrical and Electronics Engineers(“IEEE”) 1394 bus (“FireWire”), or the like. In other embodiments, thesolid-state storage device 102 is connected to the computer 112 using aperipheral component interconnect (“PCI”) express bus using externalelectrical or optical bus extension or bus networking solution such asInfiniband or PCI Express Advanced Switching (“PCIe-AS”), or the like.

In various embodiments, the solid-state storage device 102 may be in theform of a dual-inline memory module (“DIMM”), a daughter card, or amicro-module. In another embodiment, the solid-state storage device 102is an element within a rack-mounted blade. In another embodiment, thesolid-state storage device 102 is contained within a package that isintegrated directly onto a higher level assembly (e.g. mother board, laptop, graphics processor). In another embodiment, individual componentscomprising the solid-state storage device 102 are integrated directlyonto a higher level assembly without intermediate packaging.

The solid-state storage device 102 includes one or more solid-statestorage controllers 104, each may include a write data pipeline 106 anda read data pipeline 108 and each includes a solid-state storage media110, which are described in more detail below with respect to FIGS. 2and 3.

The system 100 includes one or more computers 112 connected to thesolid-state storage device 102. A computer 112 may be a host, a server,a storage controller of a storage area network (“SAN”), a workstation, apersonal computer, a laptop computer, a handheld computer, asupercomputer, a computer cluster, a network switch, router, orappliance, a database or storage appliance, a data acquisition or datacapture system, a diagnostic system, a test system, a robot, a portableelectronic device, a wireless device, or the like. In anotherembodiment, a computer 112 may be a client and the solid-state storagedevice 102 operates autonomously to service data requests sent from thecomputer 112. In this embodiment, the computer 112 and solid-statestorage device 102 may be connected using a computer network, systembus, or other communication means suitable for connection between acomputer 112 and an autonomous solid-state storage device 102.

In one embodiment, the system 100 includes one or more clients 114connected to one or more computer 112 through one or more computernetworks 116. A client 114 may be a host, a server, a storage controllerof a SAN, a workstation, a personal computer, a laptop computer, ahandheld computer, a supercomputer, a computer cluster, a networkswitch, router, or appliance, a database or storage appliance, a dataacquisition or data capture system, a diagnostic system, a test system,a robot, a portable electronic device, a wireless device, or the like.The computer network 116 may include the Internet, a wide area network(“WAN”), a metropolitan area network (“MAN”), a local area network(“LAN”), a token ring, a wireless network, a fiber channel network, aSAN, network attached storage (“NAS”), ESCON, or the like, or anycombination of networks. The computer network 116 may also include anetwork from the IEEE 802 family of network technologies, such Ethernet,token ring, WiFi, WiMax, and the like.

The computer network 116 may include servers, switches, routers,cabling, radios, and other equipment used to facilitate networkingcomputers 112 and clients 114. In one embodiment, the system 100includes multiple computers 112 that communicate as peers over acomputer network 116. In another embodiment, the system 100 includesmultiple solid-state storage devices 102 that communicate as peers overa computer network 116. One of skill in the art will recognize othercomputer networks 116 comprising one or more computer networks 116 andrelated equipment with single or redundant connection between one ormore clients 114 or other computer with one or more solid-state storagedevices 102 or one or more solid-state storage devices 102 connected toone or more computers 112. In one embodiment, the system 100 includestwo or more solid-state storage devices 102 connected through thecomputer network 116 to a client 114 without a computer 112.

Solid-State Storage Device

FIG. 2 is a schematic block diagram illustrating one embodiment 201 of asolid-state storage device controller 202 that includes a write datapipeline 106 and a read data pipeline 108 in a solid-state storagedevice 102 in accordance with the present invention. The solid-statestorage device controller 202 may be embodied as hardware, as software,or as a combination of hardware and software. The solid-state storagedevice controller 202 may include a number of solid-state storagecontrollers 0-N 104 a-n, each controlling solid-state storage media 110.In the depicted embodiment, two solid-state controllers are shown:solid-state controller 0 104 a and solid-state storage controller N 104n, and each controls solid-state storage media 110 a-n. In the depictedembodiment, solid-state storage controller 0 104 a controls a datachannel so that the attached solid-state storage media 110 a storesdata. Solid-state storage controller N 104 n controls an index metadatachannel associated with the stored data and the associated solid-statestorage media 110 n stores index metadata. In an alternate embodiment,the solid-state storage device controller 202 includes a singlesolid-state controller 104 a with a single solid-state storage media 110a. In another embodiment, there are a plurality of solid-state storagecontrollers 104 a-n and associated solid-state storage media 110 a-n. Inone embodiment, one or more solid-state controllers 104 a-104 n-1,coupled to their associated solid-state storage media 110 a-110 n-1,control data while at least one solid-state storage controller 104 n,coupled to its associated solid-state storage media 110 n, controlsindex metadata.

In one embodiment, at least one solid-state controller 104 isfield-programmable gate array (“FPGA”) and controller functions areprogrammed into the FPGA. In a particular embodiment, the FPGA is aXilinx® FPGA. In another embodiment, the solid-state storage controller104 comprises components specifically designed as a solid-state storagecontroller 104, such as an application-specific integrated circuit(“ASIC”) or custom logic solution. Each solid-state storage controller104 typically includes a write data pipeline 106 and a read datapipeline 108, which are describe further in relation to FIG. 3. Inanother embodiment, at least one solid-state storage controller 104 ismade up of a combination FPGA, ASIC, and custom logic components.

Solid-State Storage

The solid-state storage media 110 is an array of non-volatilesolid-state storage elements 216, 218, 220, arranged in banks 214, andaccessed in parallel through a bi-directional storage input/output(“I/O”) bus 210. The storage I/O bus 210, in one embodiment, is capableof unidirectional communication at any one time. For example, when datais being written to the solid-state storage media 110, data cannot beread from the solid-state storage media 110. In another embodiment, datacan flow both directions simultaneously. However bi-directional, as usedherein with respect to a data bus, refers to a data pathway that canhave data flowing in only one direction at a time, but when data flowingone direction on the bi-directional data bus is stopped, data can flowin the opposite direction on the bi-directional data bus.

A solid-state storage element (e.g. SSS 0.0 216 a) is typicallyconfigured as a chip (a package of one or more dies) or a die on acircuit board. As depicted, a solid-state storage element (e.g. 216 a)operates independently or semi-independently of other solid-statestorage elements (e.g. 218 a) even if these several elements arepackaged together in a chip package, a stack of chip packages, or someother package element. As depicted, a column of solid-state storageelements 216, 218, 220 is designated as a bank 214. As depicted, theremay be “n” banks 214 a-n and “m” solid-state storage elements 216 a-m,218 a-m, 220 a-m per bank in an array of n.times.m solid-state storageelements 216, 218, 220 in a solid-state storage media 110. In oneembodiment, a solid-state storage media 110 a includes twentysolid-state storage elements per bank (e.g. 216 a-m in bank 214 a, 218a-m in bank 214 b, 220 a-m in bank 214 n, where m=22) with eight banks(e.g. 214 a-n where n=8) and a solid-state storage media 110 n includestwo solid-state storage elements (e.g. 216 a-m where m=2) per bank 214with one bank 214 a. There is no requirement that two solid-statestorage media 110 a, 110 n have the same number of solid-state storageelements and/or same number of banks 214. In one embodiment, eachsolid-state storage element 216, 218, 220 is comprised of a single-levelcell (“SLC”) devices. In another embodiment, each solid-state storageelement 216, 218, 220 is comprised of multi-level cell (“MLC”) devices.

In one embodiment, solid-state storage elements for multiple banks thatshare a common storage I/O bus 210 a row (e.g. 216 b, 218 b, 220 b) arepackaged together. In one embodiment, a solid-state storage element 216,218, 220 may have one or more dies per chip with one or more chipsstacked vertically and each die may be accessed independently. Inanother embodiment, a solid-state storage element (e.g. SSS 0.0 216 a)may have one or more virtual dies per die and one or more dies per chipand one or more chips stacked vertically and each virtual die may beaccessed independently. In another embodiment, a solid-state storageelement SSS 0.0 216 a may have one or more virtual dies per die and oneor more dies per chip with some or all of the one or more dies stackedvertically and each virtual die may be accessed independently.

In one embodiment, two dies are stacked vertically with four stacks pergroup to form eight storage elements (e.g. SSS 0.0-SSS 0.8) 216 a-220 a,each in a separate bank 214 a-n. In another embodiment, 20 storageelements (e.g. SSS 0.0-SSS 20.0) 216 form a virtual bank 214 a so thateach of the eight virtual banks has 20 storage elements (e.g. SSS0.0-SSS20.8). Data is sent to the solid-state storage media 110 over thestorage I/O bus 210 to all storage elements of a particular group ofstorage elements (SSS 0.0-SSS 0.8) 216 a, 218 a, 220 a. The storagecontrol bus 212 a is used to select a particular bank (e.g. Bank-0214 a)so that the data received over the storage I/O bus 210 connected to allbanks 214 is written just to the selected bank 214 a.

In certain embodiments, the storage control bus 212 and storage I/O bus210 are used together by the solid-state controller 104 to communicateaddressing information, storage element command information, and data tobe stored. Those of skill in the art recognize that this address, data,and command information may be communicated using one or the other ofthese buses 212, 210, or using separate buses for each type of controlinformation. In one embodiment, addressing information, storage elementcommand information, and storage data travel on the storage I/O bus 210and the storage control bus 212 carries signals for activating a bank aswell as identifying whether the data on the storage I/O bus 210 linesconstitute addressing information, storage element command information,or storage data.

For example, a control signal on the storage control bus 212 such as“command enable” may indicate that the data on the storage I/O bus 210lines is a storage element command such as program, erase, reset, read,and the like. A control signal on the storage control bus 212 such as“address enable” may indicate that the data on the storage I/O bus 210lines is addressing information such as erase block identifier, pageidentifier, and optionally offset within the page within a particularstorage element. Finally, an absence of a control signal on the storagecontrol bus 212 for both “command enable” and “address enable” mayindicate that the data on the storage I/O bus 210 lines is storage datathat is to be stored on the storage element at a previously addressederase block, physical page, and optionally offset within the page of aparticular storage element.

In one embodiment, the storage I/O bus 210 is comprised of one or moreindependent I/O buses (“IIOBa-m” comprising 210 a.a-m, 210 n.a-m)wherein the solid-state storage elements within each row share one ofthe independent I/O buses across each solid-state storage element 216,218, 220 in parallel so that all banks 214 are accessed simultaneously.For example, one HOB 210 a. a of the storage I/O bus 210 may access afirst solid-state storage element 216 a, 218 a, 220 a of each bank 214a-n simultaneously. A second HOB 210 a.b of the storage I/O bus 210 mayaccess a second solid-state storage element 216 b, 218 b, 220 b of eachbank 214 a-n simultaneously. Each row of solid-state storage elements216, 218, 220 is accessed simultaneously. In one embodiment, wheresolid-state storage elements 216, 218, 220 are multi-level (physicallystacked), all physical levels of the solid-state storage elements 216,218, 220 are accessed simultaneously. As used herein, “simultaneously”also includes near simultaneous access where devices are accessed atslightly different intervals to avoid switching noise. Simultaneously isused in this context to be distinguished from a sequential or serialaccess wherein commands and/or data are sent individually one after theother.

Typically, banks 214 a-n are independently selected using the storagecontrol bus 212. In one embodiment, a bank 214 is selected using a chipenable or chip select. Where both chip select and chip enable areavailable, the storage control bus 212 may select one level of amulti-level solid-state storage element 216, 218, 220 using either ofthe chip select signal and the chip enable signal. In other embodiments,other commands are used by the storage control bus 212 to individuallyselect one level of a multi-level solid-state storage element 216, 218,220. Solid-state storage elements 216, 218, 220 may also be selectedthrough a combination of control and of address information transmittedon storage I/O bus 210 and the storage control bus 212.

In one embodiment, each solid-state storage element 216, 218, 220 ispartitioned into erase blocks and each erase block is partitioned intopages. A typical page is 2000 bytes (“2 kB”). In one example, asolid-state storage element (e.g. SSS0.0) includes two registers and canprogram two pages so that a two-register solid-state storage element hasa page size of 4 kB. A single bank 214 a of 20 solid-state storageelements 216 a-m would then have an 80 kB capacity of pages accessedwith the same address going out of the storage I/O bus 210.

This group of pages in a bank 214 of solid-state storage elements 216,218, 220 of 80 kB may be called a logical or virtual page. Similarly, anerase block of each storage element 216 a-m of a bank 214 a may begrouped to form a logical erase block. In one embodiment, erasing alogical erase block causes a physical erase block of each storageelement 216 a-m of a bank 214 a to be erased. In one embodiment, anerase block of pages within a solid-state storage element 216, 218, 220is erased when an erase command is received within a solid-state storageelement 216, 218, 220. In another embodiment, a single physical eraseblock on each storage element (e.g. SSS M.N) collectively forms alogical erase block for the solid-state storage media 110 a. In such anembodiment, erasing a logical erase block comprises erasing an eraseblock at the same address within each storage element (e.g. SSS M.N) inthe solid-state storage array 110 a. Whereas the size and number oferase blocks, pages, planes, or other logical and physical divisionswithin a solid-state storage element 216, 218, 220 may change over timewith advancements in technology, it is to be expected that manyembodiments consistent with new configurations are possible and areconsistent with the general description herein.

In one embodiment, data is written in packets to the storage elements.The solid-state controller 104 uses the storage I/O bus 210 and storagecontrol bus 212 to address a particular bank 214, storage element 216,218, 220, physical erase block, physical page, and optionally offsetwithin a physical page for writing the data packet. In one embodiment,the solid-state controller 104 sends the address information for thedata packet by way of the storage I/O bus 210 and signals that the dataon the storage I/O bus 210 is address data by way of particular signalsset on the storage control bus 212. The solid-state controller 104follows the transmission of the address information with transmission ofthe data packet of data that is to be stored. The physical addresscontains enough information for the solid-state storage element 216,218, 220 to direct the data packet to the designated location within thepage.

In one embodiment, the storage I/O bus 210 a.a connects to each storageelement in a row of storage elements (e.g. SSS 0.0-SSS 0.N 216 a, 218 a,220 a). In such an embodiment, the solid-state controller 104 aactivates a desired bank 214 a using the storage control bus 212 a, suchthat data on storage I/O bus 210 a.a reaches the proper page of a singlestorage element (e.g. SSS 0.0 216 a).

In addition, in certain embodiments, the solid-state controller 104 asimultaneously activates the same bank 214 a using the storage controlbus 212 a, such that different data (a different data packet) on storageI/O bus 210 a.b reaches the proper page of a single storage element onanother row (e.g. SSS 1.0 216 b). In this manner, multiple physicalpages of multiple storage elements 216, 218, 220 may be written tosimultaneously within a single bank 214 to store a logical page.

Similarly, a read command may require a command on the storage controlbus 212 to select a single bank 214 a and the appropriate page withinthat bank 214 a. In one embodiment, a read command reads an entirephysical page from each storage element, and because there are multiplesolid-state storage elements 216, 218, 220 in parallel in a bank 214, anentire logical page is read with a read command. However, the readcommand may be broken into subcommands, as will be explained below withrespect to bank interleave. A logical page may also be accessed in awrite operation.

In one embodiment, a solid-state controller 104 may send an erase blockerase command over all the lines of the storage I/O bus 210 to erase aphysical erase block having a particular erase block address. Inaddition, the solid-state controller 104 may simultaneously activate asingle bank 214 using the storage control bus 212 such that eachphysical erase block in the single activated bank 214 is erased as partof a logical erase block.

In another embodiment, the solid-state controller 104 may send an eraseblock erase command over all the lines of the storage I/O bus 210 toerase a physical erase block having a particular erase block address oneach storage element 216, 218, 220 (SSS 0.0-SSS M.N). These particularphysical erase blocks together may form a logical erase block. Once theaddress of the physical erase blocks is provided to the storage elements216, 218, 220, the solid-state controller 104 may initiate the erasecommand on a bank 214 a by bank 214 b by bank 214 n basis (either inorder or based on some other sequence). Other commands may also be sentto a particular location using a combination of the storage I/O bus 210and the storage control bus 212. One of skill in the art will recognizeother ways to select a particular storage location using thebi-directional storage I/O bus 210 and the storage control bus 212.

In one embodiment, the storage controller 104 sequentially writes dataon the solid-state storage media 110 in a log structured format andwithin one or more physical structures of the storage elements, the datais sequentially stored on the solid-state storage media 110.Sequentially writing data involves the storage controller 104 streamingdata packets into storage write buffers for storage elements, such as achip (a package of one or more dies) or a die on a circuit board. Whenthe storage write buffers are full, the data packets are programmed to adesignated virtual or logical page (“LP”). Data packets then refill thestorage write buffers and, when full, the data packets are written tothe next LP. The next virtual page may be in the same bank 214 a oranother bank (e.g. 214 b). This process continues, LP after LP,typically until a virtual or logical erase block (“LEB”) is filled. LPsand LEBs are described in more detail below.

In another embodiment, the streaming may continue across LEB boundarieswith the process continuing, LEB after LEB. Typically, the storagecontroller 104 sequentially stores data packets in an LEB by order ofprocessing. In one embodiment, where a write data pipeline 106 is used,the storage controller 104 stores packets in the order that they comeout of the write data pipeline 106. This order may be a result of datasegments arriving from a requesting device mixed with packets of validdata that are being read from another storage location as valid data isbeing recovered from another LEB during a recovery operation.

The sequentially stored data, in one embodiment, can serve as a log toreconstruct data indexes and other metadata using information from datapacket headers. For example, in one embodiment, the storage controller104 may reconstruct a storage index by reading headers to determine thedata structure to which each packet belongs and sequence information todetermine where in the data structure the data or metadata belongs. Thestorage controller 104, in one embodiment, uses physical addressinformation for each packet and timestamp or sequence information tocreate a mapping between the physical locations of the packets and thedata structure identifier and data segment sequence. Timestamp orsequence information is used by the storage controller 104 to replay thesequence of changes made to the index and thereby reestablish the mostrecent state.

In one embodiment, erase blocks are time stamped or given a sequencenumber as packets are written and the timestamp or sequence informationof an erase block is used along with information gathered from containerheaders and packet headers to reconstruct the storage index. In anotherembodiment, timestamp or sequence information is written to an eraseblock when the erase block is recovered.

In a read, modify, write operation, data packets associated with thelogical structure are located and read in a read operation. Datasegments of the modified structure that have been modified are notwritten to the location from which they are read. Instead, the modifieddata segments are again converted to data packets and then written tothe next available location in the virtual page currently being written.Index entries for the respective data packets are modified to point tothe packets that contain the modified data segments. The entry orentries in the index for data packets associated with the same logicalstructure that have not been modified will include pointers to originallocation of the unmodified data packets. Thus, if the original logicalstructure is maintained, for example to maintain a previous version ofthe logical structure, the original logical structure will have pointersin the index to all data packets as originally written. The new logicalstructure will have pointers in the index to some of the original datapackets and pointers to the modified data packets in the virtual pagethat is currently being written.

In a copy operation, the index includes an entry for the originallogical structure mapped to a number of packets stored on thesolid-state storage media 110. When a copy is made, a new logicalstructure is created and a new entry is created in the index mapping thenew logical structure to the original packets. The new logical structureis also written to the solid-state storage media 110 with its locationmapped to the new entry in the index. The new logical structure packetsmay be used to identify the packets within the original logicalstructure that are referenced in case changes have been made in theoriginal logical structure that have not been propagated to the copy andthe index is lost or corrupted. In another embodiment, the indexincludes a logical entry for a logical block.

Beneficially, sequentially writing packets facilitates a more even useof the solid-state storage media 110 and allows the solid-storage devicecontroller 202 to monitor storage hot spots and level usage of thevarious virtual pages in the solid-state storage media 110. Sequentiallywriting packets also facilitates a powerful, efficient garbagecollection system, which is described in detail below. One of skill inthe art will recognize other benefits of sequential storage of datapackets.

The system 100 may comprise a log-structured storage system orlog-structured array similar to a log-structured file system and theorder that data is stored may be used to recreate an index. Typically anindex that includes a logical-to-physical mapping is stored in volatilememory. The index is referred to as a logical-to-physical map herein. Ifthe index is corrupted or lost, the index may be reconstructed byaddressing the solid-state storage media 110 in the order that the datawas written. Within a logical erase block (“LEB”), data is typicallystored sequentially by filling a first logical page, then a secondlogical page, etc. until the LEB is filled. The solid-state storagecontroller 104 then chooses another LEB and the process repeats. Bymaintaining an order that the LEBs were written to and by knowing thateach LEB is written sequentially, the index can be rebuilt by traversingthe solid-state storage media 110 in order from beginning to end. Inother embodiments, if part of the index is stored in non-volatilememory, such as on the solid-state storage media 110, the solid-statestorage controller 104 may only need to replay a portion of thesolid-state storage media 110 to rebuild a portion of the index that wasnot stored in non-volatile memory. One of skill in the art willrecognize other benefits of sequential storage of data packets.

Solid-State Storage Device Controller

In various embodiments, the solid-state storage device controller 202also includes a data bus 204, a local bus 206, a buffer controller 208,buffers 0-N 222 a-n, a master controller 224, a direct memory access(“DMA”) controller 226, a memory controller 228, a dynamic memory array230, a static random memory array 232, a management controller 234, amanagement bus 236, a bridge 238 to a system bus 240, and miscellaneouslogic 242, which are described below. In other embodiments, the systembus 240 is coupled to one or more network interface cards (“NICs”) 244,some of which may include remote DMA (“RDMA”) controllers 246, one ormore central processing unit (“CPU”) 248, one or more external memorycontrollers 250 and associated external memory arrays 252, one or morestorage controllers 254, peer controllers 256, and application specificprocessors 258, which are described below. The components 244-258connected to the system bus 240 may be located in the computer 112 ormay be other devices.

In one embodiment, the solid-state storage controller(s) 104 communicatedata to the solid-state storage media 110 over a storage I/O bus 210. Ina certain embodiment where the solid-state storage is arranged in banks214 and each bank 214 includes multiple storage elements 216, 218, 220accessible in parallel, the storage I/O bus 210 comprises an array ofbusses, one for each row of storage elements 216, 218, 220 spanning thebanks 214. As used herein, the term “storage I/O bus” may refer to onestorage I/O bus 210 or an array of data independent busses 204. In oneembodiment, each storage I/O bus 210 accessing a row of storage elements(e.g. 216 a, 218 a, 220 a) may include a logical-to-physical mapping forstorage divisions (e.g. erase blocks) accessed in a row of storageelements 216 a, 218 a, 220 a. This mapping allows a logical addressmapped to a physical address of a storage division to be remapped to adifferent storage division if the first storage division fails,partially fails, is inaccessible, or has some other problem. Remappingis explained further in relation to the remapping module 430 of FIG. 4.

Data may also be communicated to the solid-state storage controller(s)104 from a requesting device 155 through the system bus 240, bridge 238,local bus 206, buffer(s) 222, and finally over a data bus 204. The databus 204 typically is connected to one or more buffers 222 a-n controlledwith a buffer controller 208. The buffer controller 208 typicallycontrols transfer of data from the local bus 206 to the buffers 222 andthrough the data bus 204 to the pipeline input buffer 306 and outputbuffer 330. The buffer controller 208 typically controls how dataarriving from a requesting device 155 can be temporarily stored in abuffer 222 and then transferred onto a data bus 204, or vice versa, toaccount for different clock domains, to prevent data collisions, etc.The buffer controller 208 typically works in conjunction with the mastercontroller 224 to coordinate data flow. As data arrives, the data willarrive on the system bus 240, be transferred to the local bus 206through a bridge 238.

Typically the data is transferred from the local bus 206 to one or moredata buffers 222 as directed by the master controller 224 and the buffercontroller 208. The data then flows out of the buffer(s) 222 to the databus 204, through a solid-state controller 104, and on to the solid-statestorage media 110 such as NAND flash or other storage media. In oneembodiment, data and associated out-of-band metadata (“metadata”)arriving with the data is communicated using one or more data channelscomprising one or more solid-state storage controllers 104 a-104 n-1 andassociated solid-state storage media 110 a-110 n-1 while at least onechannel (solid-state storage controller 104 n, solid-state storage media110 n) is dedicated to in-band metadata, such as index information andother metadata generated internally to the solid-state storage device102.

The local bus 206 is typically a bidirectional bus or set of busses thatallows for communication of data and commands between devices internalto the solid-state storage device controller 202 and between devicesinternal to the solid-state storage device 102 and devices 244-258connected to the system bus 240. The bridge 238 facilitatescommunication between the local bus 206 and system bus 240. One of skillin the art will recognize other embodiments such as ring structures orswitched star configurations and functions of buses 240, 206, 204 andbridges 238.

The system bus 240 is typically a bus of a computer 112 or other devicein which the solid-state storage device 102 is installed or connected.In one embodiment, the system bus 240 may be a PCI-e bus, a SerialAdvanced Technology Attachment (“serial ATA”) bus, parallel ATA, or thelike. In another embodiment, the system bus 240 is an external bus suchas small computer system interface (“SCSI”), FireWire, Fiber Channel,USB, PCIe-AS, or the like. The solid-state storage device 102 may bepackaged to fit internally to a device or as an externally connecteddevice.

The solid-state storage device controller 202 includes a mastercontroller 224 that controls higher-level functions within thesolid-state storage device 102. The master controller 224, in variousembodiments, controls data flow by interpreting requests, directscreation of indexes to map identifiers associated with data to physicallocations of associated data, coordinating DMA requests, etc. Many ofthe functions described herein are controlled wholly or in part by themaster controller 224.

In one embodiment, the master controller 224 uses embeddedcontroller(s). In another embodiment, the master controller 224 useslocal memory such as a dynamic memory array 230 (dynamic random accessmemory “DRAM”), a static memory array 232 (static random access memory“SRAM”), etc. In one embodiment, the local memory is controlled usingthe master controller 224. In another embodiment, the master controller224 accesses the local memory via a memory controller 228. In anotherembodiment, the master controller 224 runs a Linux server and maysupport various common server interfaces, such as the World Wide Web,hyper-text markup language (“HTML”), etc. In another embodiment, themaster controller 224 uses a nano-processor. The master controller 224may be constructed using programmable or standard logic, or anycombination of controller types listed above. The master controller 224may be embodied as hardware, as software, or as a combination ofhardware and software. One skilled in the art will recognize manyembodiments for the master controller 224.

In one embodiment, where the storage controller 152/solid-state storagedevice controller 202 manages multiple data storage devices/solid-statestorage media 110 a-n, the master controller 224 divides the work loadamong internal controllers, such as the solid-state storage controllers104 a-n. For example, the master controller 224 may divide a datastructure to be written to the data storage devices (e.g. solid-statestorage media 110 a-n) so that a portion of the data structure is storedon each of the attached data storage devices. This feature is aperformance enhancement allowing quicker storage and access to a datastructure. In one embodiment, the master controller 224 is implementedusing an FPGA. In another embodiment, the firmware within the mastercontroller 224 may be updated through the management bus 236, the systembus 240 over a network connected to a NIC 244 or other device connectedto the system bus 240.

In one embodiment, the master controller 224 emulates block storage suchthat a computer 112 or other device connected to the storagedevice/solid-state storage device 102 views the storagedevice/solid-state storage device 102 as a block storage device andsends data to specific physical addresses in the storagedevice/solid-state storage device 102. The master controller 224 thendivides up the blocks and stores the data blocks. The master controller224 then maps the blocks and physical address sent with the block to theactual locations determined by the master controller 224. The mapping isstored in the index. Typically, for block emulation, a block deviceapplication program interface (“API”) is provided in a driver in thecomputer 112, client 114, or other device wishing to use the storagedevice/solid-state storage device 102 as a block storage device.

In another embodiment, the master controller 224 coordinates with NICcontrollers 244 and embedded RDMA controllers 246 to deliverjust-in-time RDMA transfers of data and command sets. NIC controller 244may be hidden behind a non-transparent port to enable the use of customdrivers. Also, a driver on a client 114 may have access to the computernetwork 116 through an I/O memory driver using a standard stack API andoperating in conjunction with NICs 244.

In one embodiment, the master controller 224 is also a redundant arrayof independent drive (“RAID”) controller. Where the data storagedevice/solid-state storage device 102 is networked with one or moreother data storage devices/solid-state storage devices 102, the mastercontroller 224 may be a RAID controller for single tier RAID, multi-tierRAID, progressive RAID, etc. The master controller 224 may also allowsome objects and other data structures to be stored in a RAID array andother data structures to be stored without RAID. In another embodiment,the master controller 224 may be a distributed RAID controller element.In another embodiment, the master controller 224 may comprise many RAID,distributed RAID, and other functions as described elsewhere.

In one embodiment, the master controller 224 coordinates with single orredundant network managers (e.g. switches) to establish routing, tobalance bandwidth utilization, failover, etc. In another embodiment, themaster controller 224 coordinates with integrated application specificlogic (via local bus 206) and associated driver software. In anotherembodiment, the master controller 224 coordinates with attachedapplication specific processors 258 or logic (via the external systembus 240) and associated driver software. In another embodiment, themaster controller 224 coordinates with remote application specific logic(via the computer network 116) and associated driver software. Inanother embodiment, the master controller 224 coordinates with the localbus 206 or external bus attached hard disk drive (“HDD”) storagecontroller.

In one embodiment, the master controller 224 communicates with one ormore storage controllers 254 where the storage device/solid-statestorage device 102 may appear as a storage device connected through aSCSI bus, Internet SCSI (“iSCSI”), fiber channel, etc. Meanwhile thestorage device/solid-state storage device 102 may autonomously manageobjects or other data structures and may appear as an object file systemor distributed object file system. The master controller 224 may also beaccessed by peer controllers 256 and/or application specific processors258.

In another embodiment, the master controller 224 coordinates with anautonomous integrated management controller to periodically validateFPGA code and/or controller software, validate FPGA code while running(reset) and/or validate controller software during power on (reset),support external reset requests, support reset requests due to watchdogtimeouts, and support voltage, current, power, temperature, and otherenvironmental measurements and setting of threshold interrupts. Inanother embodiment, the master controller 224 manages garbage collectionto free erase blocks for reuse. In another embodiment, the mastercontroller 224 manages wear leveling. In another embodiment, the mastercontroller 224 allows the data storage device/solid-state storage device102 to be partitioned into multiple virtual devices and allowspartition-based media encryption. In yet another embodiment, the mastercontroller 224 supports a solid-state storage controller 104 withadvanced, multi-bit ECC correction. One of skill in the art willrecognize other features and functions of a master controller 224 in astorage controller 152, or more specifically in a solid-state storagedevice 102.

In one embodiment, the solid-state storage device controller 202includes a memory controller 228 which controls a dynamic random memoryarray 230 and/or a static random memory array 232. As stated above, thememory controller 228 may be independent or integrated with the mastercontroller 224. The memory controller 228 typically controls volatilememory of some type, such as DRAM (dynamic random memory array 230) andSRAM (static random memory array 232). In other examples, the memorycontroller 228 also controls other memory types such as electricallyerasable programmable read only memory (“EEPROM”), etc. In otherembodiments, the memory controller 228 controls two or more memory typesand the memory controller 228 may include more than one controller.Typically, the memory controller 228 controls as much SRAM 232 as isfeasible and by DRAM 230 to supplement the SRAM 232.

In one embodiment, the logical-to-physical index is stored in memory230, 232 and then periodically off-loaded to a channel of thesolid-state storage media 110 n or other non-volatile memory. One ofskill in the art will recognize other uses and configurations of thememory controller 228, dynamic memory array 230, and static memory array232.

In one embodiment, the solid-state storage device controller 202includes a DMA controller 226 that controls DMA operations between thestorage device/solid-state storage device 102 and one or more externalmemory controllers 250 and associated external memory arrays 252 andCPUs 248. Note that the external memory controllers 250 and externalmemory arrays 252 are called external because they are external to thestorage device/solid-state storage device 102. In addition the DMAcontroller 226 may also control RDMA operations with requesting devicesthrough a NIC 244 and associated RDMA controller 246.

In one embodiment, the solid-state storage device controller 202includes a management controller 234 connected to a management bus 236.Typically the management controller 234 manages environmental metricsand status of the storage device/solid-state storage device 102. Themanagement controller 234 may monitor device temperature, fan speed,power supply settings, etc. over the management bus 236. The managementcontroller 234 may support the reading and programming of erasableprogrammable read only memory (“EEPROM”) for storage of FPGA code andcontroller software. Typically the management bus 236 is connected tothe various components within the storage device/solid-state storagedevice 102. The management controller 234 may communicate alerts,interrupts, etc. over the local bus 206 or may include a separateconnection to a system bus 240 or other bus. In one embodiment themanagement bus 236 is an Inter-Integrated Circuit (“I.sup.2C”) bus. Oneof skill in the art will recognize other related functions and uses of amanagement controller 234 connected to components of the storagedevice/solid-state storage device 102 by a management bus 236.

In one embodiment, the solid-state storage device controller 202includes miscellaneous logic 242 that may be customized for a specificapplication. Typically where the solid-state device controller 202 ormaster controller 224 is/are configured using a FPGA or otherconfigurable controller, custom logic may be included based on aparticular application, customer requirement, storage requirement, etc.

Data Pipeline

FIG. 3 is a schematic block diagram illustrating one embodiment 300 of asolid-state storage controller 104 with a write data pipeline 106 and aread data pipeline 108 in a solid-state storage device 102 in accordancewith the present invention. The embodiment 300 includes a data bus 204,a local bus 206, and buffer control 208, which are substantially similarto those described in relation to the solid-state storage devicecontroller 202 of FIG. 2. The write data pipeline 106 includes apacketizer 302 and an error-correcting code (“ECC”) generator 304. Inother embodiments, the write data pipeline 106 includes an input buffer306, a write synchronization buffer 308, a write program module 310, acompression module 312, an encryption module 314, a garbage collectorbypass 316 (with a portion within the read data pipeline 108), a mediaencryption module 318, and a write buffer 320. The read data pipeline108 includes a read synchronization buffer 328, an ECC correction module322, a depacketizer 324, an alignment module 326, and an output buffer330. In other embodiments, the read data pipeline 108 may include amedia decryption module 332, a portion of the garbage collector bypass316, a decryption module 334, a decompression module 336, and a readprogram module 338. The solid-state storage controller 104 may alsoinclude control and status registers 340 and control queues 342, a bankinterleave controller 344, a synchronization buffer 346, a storage buscontroller 348, and a multiplexer (“MUX”) 350. The components of thesolid-state controller 104 and associated write data pipeline 106 andread data pipeline 108 are described below. In other embodiments,synchronous solid-state storage media 110 may be used andsynchronization buffers 308 328 may be eliminated.

Write Data Pipeline

The write data pipeline 106 includes a packetizer 302 that receives adata or metadata segment to be written to the solid-state storage,either directly or indirectly through another write data pipeline 106stage, and creates one or more packets sized for the solid-state storagemedia 110. The data or metadata segment is typically part of a datastructure such as an object, but may also include an entire datastructure. In another embodiment, the data segment is part of a block ofdata, but may also include an entire block of data. Typically, a set ofdata such as a data structure is received from a computer 112, client114, or other computer or device and is transmitted to the solid-statestorage device 102 in data segments streamed to the solid-state storagedevice 102 or computer 112. A data segment may also be known by anothername, such as data parcel, but as referenced herein includes all or aportion of a data structure or data block.

Each data structure is stored as one or more packets. Each datastructure may have one or more container packets. Each packet contains aheader. The header may include a header type field. Type fields mayinclude data, attribute, metadata, data segment delimiters(multi-packet), data structures, data linkages, and the like. The headermay also include information regarding the size of the packet, such asthe number of bytes of data included in the packet. The length of thepacket may be established by the packet type. The header may includeinformation that establishes the relationship of the packet to a datastructure. An example might be the use of an offset in a data packetheader to identify the location of the data segment within the datastructure. One of skill in the art will recognize other information thatmay be included in a header added to data by a packetizer 302 and otherinformation that may be added to a data packet.

Each packet includes a header and possibly data from the data ormetadata segment. The header of each packet includes pertinentinformation to relate the packet to the data structure to which thepacket belongs. For example, the header may include an object identifieror other data structure identifier and offset that indicates the datasegment, object, data structure or data block from which the data packetwas formed. The header may also include a logical address used by thestorage bus controller 348 to store the packet. The header may alsoinclude information regarding the size of the packet, such as the numberof bytes included in the packet. The header may also include a sequencenumber that identifies where the data segment belongs with respect toother packets within the data structure when reconstructing the datasegment or data structure. The header may include a header type field.Type fields may include data, data structure attributes, metadata, datasegment delimiters (multi-packet), data structure types, data structurelinkages, and the like. One of skill in the art will recognize otherinformation that may be included in a header added to data or metadataby a packetizer 302 and other information that may be added to a packet.

The write data pipeline 106 includes an ECC generator 304 that thatgenerates one or more error-correcting codes (“ECC”) for the one or morepackets received from the packetizer 302. The ECC generator 304typically uses an error correcting algorithm to generate ECC check bitswhich are stored with the one or more data packets. The ECC codesgenerated by the ECC generator 304 together with the one or more datapackets associated with the ECC codes comprise an ECC chunk. The ECCdata stored with the one or more data packets is used to detect and tocorrect errors introduced into the data through transmission andstorage. In one embodiment, packets are streamed into the ECC generator304 as un-encoded blocks of length N. A syndrome of length S iscalculated, appended and output as an encoded block of length N+S. Thevalue of N and S are dependent upon the characteristics of the algorithmwhich is selected to achieve specific performance, efficiency, androbustness metrics. In one embodiment, there is no fixed relationshipbetween the ECC blocks and the packets; the packet may comprise morethan one ECC block; the ECC block may comprise more than one packet; anda first packet may end anywhere within the ECC block and a second packetmay begin after the end of the first packet within the same ECC block.In one embodiment, ECC algorithms are not dynamically modified. In oneembodiment, the ECC data stored with the data packets is robust enoughto correct errors in more than two bits.

Beneficially, using a robust ECC algorithm allowing more than single bitcorrection or even double bit correction allows the life of thesolid-state storage media 110 to be extended. For example, if flashmemory is used as the storage medium in the solid-state storage media110, the flash memory may be written approximately 100,000 times withouterror per erase cycle. This usage limit may be extended using a robustECC algorithm. Having the ECC generator 304 and corresponding ECCcorrection module 322 onboard the solid-state storage device 102, thesolid-state storage device 102 can internally correct errors and has alonger useful life than if a less robust ECC algorithm is used, such assingle bit correction. However, in other embodiments the ECC generator304 may use a less robust algorithm and may correct single-bit ordouble-bit errors. In another embodiment, the solid-state storage device110 may comprise less reliable storage such as multi-level cell (“MLC”)flash in order to increase capacity, which storage may not besufficiently reliable without more robust ECC algorithms.

In one embodiment, the write pipeline 106 includes an input buffer 306that receives a data segment to be written to the solid-state storagemedia 110 and stores the incoming data segments until the next stage ofthe write data pipeline 106, such as the packetizer 302 (or other stagefor a more complex write data pipeline 106) is ready to process the nextdata segment. The input buffer 306 typically allows for discrepanciesbetween the rate data segments are received and processed by the writedata pipeline 106 using an appropriately sized data buffer. The inputbuffer 306 also allows the data bus 204 to transfer data to the writedata pipeline 106 at rates greater than can be sustained by the writedata pipeline 106 in order to improve efficiency of operation of thedata bus 204. Typically when the write data pipeline 106 does notinclude an input buffer 306, a buffering function is performedelsewhere, such as in the solid-state storage device 102 but outside thewrite data pipeline 106, in the computer 112, such as within a networkinterface card (“NIC”), or at another device, for example when usingremote direct memory access (“RDMA”).

In another embodiment, the write data pipeline 106 also includes a writesynchronization buffer 308 that buffers packets received from the ECCgenerator 304 prior to writing the packets to the solid-state storagemedia 110. The write synch buffer 308 is located at a boundary between alocal clock domain and a solid-state storage clock domain and providesbuffering to account for the clock domain differences. In otherembodiments, synchronous solid-state storage media 110 may be used andsynchronization buffers 308 328 may be eliminated.

In one embodiment, the write data pipeline 106 also includes a mediaencryption module 318 that receives the one or more packets from thepacketizer 302, either directly or indirectly, and encrypts the one ormore packets using an encryption key unique to the solid-state storagedevice 102 prior to sending the packets to the ECC generator 304.Typically, the entire packet is encrypted, including the headers. Inanother embodiment, headers are not encrypted. In this document,encryption key is understood to mean a secret encryption key that ismanaged externally from a solid-state storage controller 104.

The media encryption module 318 and corresponding media decryptionmodule 332 provide a level of security for data stored in thesolid-state storage media 110. For example, where data is encrypted withthe media encryption module 318, if the solid-state storage media 110 isconnected to a different solid-state storage controller 104, solid-statestorage device 102, or server, the contents of the solid-state storagemedia 110 typically could not be read without use of the same encryptionkey used during the write of the data to the solid-state storage media110 without significant effort.

In a typical embodiment, the solid-state storage device 102 does notstore the encryption key in non-volatile storage and allows no externalaccess to the encryption key. The encryption key is provided to thesolid-state storage controller 104 during initialization. Thesolid-state storage device 102 may use and store a non-secretcryptographic nonce that is used in conjunction with an encryption key.A different nonce may be stored with every packet. Data segments may besplit between multiple packets with unique nonces for the purpose ofimproving protection by the encryption algorithm.

The encryption key may be received from a client 114, a server, keymanager, or other device that manages the encryption key to be used bythe solid-state storage controller 104. In another embodiment, thesolid-state storage media 110 may have two or more partitions and thesolid-state storage controller 104 behaves as though it was two or moresolid-state storage controllers 104, each operating on a singlepartition within the solid-state storage media 110. In this embodiment,a unique media encryption key may be used with each partition.

In another embodiment, the write data pipeline 106 also includes anencryption module 314 that encrypts a data or metadata segment receivedfrom the input buffer 306, either directly or indirectly, prior sendingthe data segment to the packetizer 302, the data segment encrypted usingan encryption key received in conjunction with the data segment. Theencryption keys used by the encryption module 314 to encrypt data maynot be common to all data stored within the solid-state storage device102 but may vary on an per data structure basis and received inconjunction with receiving data segments as described below. Forexample, an encryption key for a data segment to be encrypted by theencryption module 314 may be received with the data segment or may bereceived as part of a command to write a data structure to which thedata segment belongs. The solid-sate storage device 102 may use andstore a non-secret cryptographic nonce in each data structure packetthat is used in conjunction with the encryption key. A different noncemay be stored with every packet. Data segments may be split betweenmultiple packets with unique nonces for the purpose of improvingprotection by the encryption algorithm.

The encryption key may be received from a client 114, a computer 112,key manager, or other device that holds the encryption key to be used toencrypt the data segment. In one embodiment, encryption keys aretransferred to the solid-state storage controller 104 from one of asolid-state storage device 102, computer 112, client 114, or otherexternal agent which has the ability to execute industry standardmethods to securely transfer and protect private and public keys.

In one embodiment, the encryption module 314 encrypts a first packetwith a first encryption key received in conjunction with the packet andencrypts a second packet with a second encryption key received inconjunction with the second packet. In another embodiment, theencryption module 314 encrypts a first packet with a first encryptionkey received in conjunction with the packet and passes a second datapacket on to the next stage without encryption. Beneficially, theencryption module 314 included in the write data pipeline 106 of thesolid-state storage device 102 allows data structure-by-data structureor segment-by-segment data encryption without a single file system orother external system to keep track of the different encryption keysused to store corresponding data structures or data segments. Eachrequesting device 155 or related key manager independently managesencryption keys used to encrypt only the data structures or datasegments sent by the requesting device 155.

In one embodiment, the encryption module 314 may encrypt the one or morepackets using an encryption key unique to the solid-state storage device102. The encryption module 314 may perform this media encryptionindependently, or in addition to the encryption described above.Typically, the entire packet is encrypted, including the headers. Inanother embodiment, headers are not encrypted. The media encryption bythe encryption module 314 provides a level of security for data storedin the solid-state storage media 110. For example, where data isencrypted with media encryption unique to the specific solid-statestorage device 102, if the solid-state storage media 110 is connected toa different solid-state storage controller 104, solid-state storagedevice 102, or computer 112, the contents of the solid-state storagemedia 110 typically could not be read without use of the same encryptionkey used during the write of the data to the solid-state storage media110 without significant effort.

In another embodiment, the write data pipeline 106 includes acompression module 312 that compresses the data for metadata segmentprior to sending the data segment to the packetizer 302. The compressionmodule 312 typically compresses a data or metadata segment using acompression routine known to those of skill in the art to reduce thestorage size of the segment. For example, if a data segment includes astring of 512 zeros, the compression module 312 may replace the 512zeros with code or token indicating the 512 zeros where the code is muchmore compact than the space taken by the 512 zeros.

In one embodiment, the compression module 312 compresses a first segmentwith a first compression routine and passes along a second segmentwithout compression. In another embodiment, the compression module 312compresses a first segment with a first compression routine andcompresses the second segment with a second compression routine. Havingthis flexibility within the solid-state storage device 102 is beneficialso that clients 114 or other devices writing data to the solid-statestorage device 102 may each specify a compression routine or so that onecan specify a compression routine while another specifies nocompression. Selection of compression routines may also be selectedaccording to default settings on a per data structure type or datastructure class basis. For example, a first data structure of a specificdata structure may be able to override default compression routinesettings and a second data structure of the same data structure classand data structure type may use the default compression routine and athird data structure of the same data structure class and data structuretype may use no compression.

In one embodiment, the write data pipeline 106 includes a garbagecollector bypass 316 that receives data segments from the read datapipeline 108 as part of a data bypass in a garbage collection system. Agarbage collection system typically marks packets that are no longervalid, typically because the packet is marked for deletion or has beenmodified and the modified data is stored in a different location. Atsome point, the garbage collection system determines that a particularsection of storage may be recovered. This determination may be due to alack of available storage capacity, the percentage of data marked asinvalid reaching a threshold, a consolidation of valid data, an errordetection rate for that section of storage reaching a threshold, orimproving performance based on data distribution, etc. Numerous factorsmay be considered by a garbage collection algorithm to determine when asection of storage is to be recovered.

Once a section of storage has been marked for recovery, valid packets inthe section typically must be relocated. The garbage collector bypass316 allows packets to be read into the read data pipeline 108 and thentransferred directly to the write data pipeline 106 without being routedout of the solid-state storage controller 104. In one embodiment, thegarbage collector bypass 316 is part of an autonomous garbage collectorsystem that operates within the solid-state storage device 102. Thisallows the solid-state storage device 102 to manage data so that data issystematically spread throughout the solid-state storage media 110 toimprove performance, data reliability and to avoid overuse and underuseof any one location or area of the solid-state storage media 110 and tolengthen the useful life of the solid-state storage media 110.

The garbage collector bypass 316 coordinates insertion of segments intothe write data pipeline 106 with other segments being written by clients114 or other devices. In the depicted embodiment, the garbage collectorbypass 316 is before the packetizer 302 in the write data pipeline 106and after the depacketizer 324 in the read data pipeline 108, but mayalso be located elsewhere in the read and write data pipelines 106, 108.The garbage collector bypass 316 may be used during a flush of the writepipeline 108 to fill the remainder of the virtual page in order toimprove the efficiency of storage within the solid-state storage media110 and thereby reduce the frequency of garbage collection.

In one embodiment, the write data pipeline 106 includes a write buffer320 that buffers data for efficient write operations. Typically, thewrite buffer 320 includes enough capacity for packets to fill at leastone virtual page in the solid-state storage media 110. This allows awrite operation to send an entire page of data to the solid-statestorage media 110 without interruption. By sizing the write buffer 320of the write data pipeline 106 and buffers within the read data pipeline108 to be the same capacity or larger than a storage write buffer withinthe solid-state storage media 110, writing and reading data is moreefficient since a single write command may be crafted to send a fullvirtual page of data to the solid-state storage media 110 instead ofmultiple commands.

While the write buffer 320 is being filled, the solid-state storagemedia 110 may be used for other read operations. This is advantageousbecause other solid-state devices with a smaller write buffer or nowrite buffer may tie up the solid-state storage when data is written toa storage write buffer and data flowing into the storage write bufferstalls. Read operations will be blocked until the entire storage writebuffer is filled and programmed. Another approach for systems without awrite buffer or a small write buffer is to flush the storage writebuffer that is not full in order to enable reads. Again this isinefficient because multiple write/program cycles are required to fill apage.

For depicted embodiment with a write buffer 320 sized larger than avirtual page, a single write command, which includes numeroussubcommands, can then be followed by a single program command totransfer the page of data from the storage write buffer in eachsolid-state storage element 216, 218, 220 to the designated page withineach solid-state storage element 216, 218, 220. This technique has thebenefits of eliminating partial page programming, which is known toreduce data reliability and durability and freeing up the destinationbank for reads and other commands while the buffer fills.

In one embodiment, the write buffer 320 is a ping-pong buffer where oneside of the buffer is filled and then designated for transfer at anappropriate time while the other side of the ping-pong buffer is beingfilled. In another embodiment, the write buffer 320 includes a first-infirst-out (“FIFO”) register with a capacity of more than a virtual pageof data segments. One of skill in the art will recognize other writebuffer 320 configurations that allow a virtual page of data to be storedprior to writing the data to the solid-state storage media 110.

In another embodiment, the write buffer 320 is sized smaller than avirtual page so that less than a page of information could be written toa storage write buffer in the solid-state storage media 110. In theembodiment, to prevent a stall in the write data pipeline 106 fromholding up read operations, data is queued using the garbage collectionsystem that needs to be moved from one location to another as part ofthe garbage collection process. In case of a data stall in the writedata pipeline 106, the data can be fed through the garbage collectorbypass 316 to the write buffer 320 and then on to the storage writebuffer in the solid-state storage media 110 to fill the pages of avirtual page prior to programming the data. In this way a data stall inthe write data pipeline 106 would not stall reading from the solid-statestorage device 102.

In another embodiment, the write data pipeline 106 includes a writeprogram module 310 with one or more user-definable functions within thewrite data pipeline 106. The write program module 310 allows a user tocustomize the write data pipeline 106. A user may customize the writedata pipeline 106 based on a particular data requirement or application.Where the solid-state storage controller 104 is an FPGA, the user mayprogram the write data pipeline 106 with custom commands and functionsrelatively easily. A user may also use the write program module 310 toinclude custom functions with an ASIC, however, customizing an ASIC maybe more difficult than with an FPGA. The write program module 310 mayinclude buffers and bypass mechanisms to allow a first data segment toexecute in the write program module 310 while a second data segment maycontinue through the write data pipeline 106. In another embodiment, thewrite program module 310 may include a processor core that can beprogrammed through software.

Note that the write program module 310 is shown between the input buffer306 and the compression module 312, however, the write program module310 could be anywhere in the write data pipeline 106 and may bedistributed among the various stages 302-320. In addition, there may bemultiple write program modules 310 distributed among the various states302-320 that are programmed and operate independently. In addition, theorder of the stages 302-320 may be altered. One of skill in the art willrecognize workable alterations to the order of the stages 302-320 basedon particular user requirements.

Read Data Pipeline

The read data pipeline 108 includes an ECC correction module 322 thatdetermines if a data error exists in ECC blocks a requested packetreceived from the solid-state storage media 110 by using ECC stored witheach ECC block of the requested packet. The ECC correction module 322then corrects any errors in the requested packet if any error exists andthe errors are correctable using the ECC. For example, if the ECC candetect an error in six bits but can only correct three bit errors, theECC correction module 322 corrects ECC blocks of the requested packetwith up to three bits in error. The ECC correction module 322 correctsthe bits in error by changing the bits in error to the correct one orzero state so that the requested data packet is identical to when it waswritten to the solid-state storage media 110 and the ECC was generatedfor the packet.

If the ECC correction module 322 determines that the requested packetscontains more bits in error than the ECC can correct, the ECC correctionmodule 322 cannot correct the errors in the corrupted ECC blocks of therequested packet and sends an interrupt. In one embodiment, the ECCcorrection module 322 sends an interrupt with a message indicating thatthe requested packet is in error. The message may include informationthat the ECC correction module 322 cannot correct the errors or theinability of the ECC correction module 322 to correct the errors may beimplied. In another embodiment, the ECC correction module 322 sends thecorrupted ECC blocks of the requested packet with the interrupt and/orthe message.

In one embodiment, a corrupted ECC block or portion of a corrupted ECCblock of the requested packet that cannot be corrected by the ECCcorrection module 322 is read by the master controller 224, corrected,and returned to the ECC correction module 322 for further processing bythe read data pipeline 108. In one embodiment, a corrupted ECC block orportion of a corrupted ECC block of the requested packet is sent to thedevice requesting the data. The requesting device 155 may correct theECC block or replace the data using another copy, such as a backup ormirror copy, and then may use the replacement data of the requested datapacket or return it to the read data pipeline 108. The requesting device155 may use header information in the requested packet in error toidentify data required to replace the corrupted requested packet or toreplace the data structure to which the packet belongs. In anotherembodiment, the solid-state storage controller 104 stores data usingsome type of RAID and is able to recover the corrupted data. In anotherembodiment, the ECC correction module 322 sends an interrupt and/ormessage and the receiving device fails the read operation associatedwith the requested data packet. One of skill in the art will recognizeother options and actions to be taken as a result of the ECC correctionmodule 322 determining that one or more ECC blocks of the requestedpacket are corrupted and that the ECC correction module 322 cannotcorrect the errors.

The read data pipeline 108 includes a depacketizer 324 that receives ECCblocks of the requested packet from the ECC correction module 322,directly or indirectly, and checks and removes one or more packetheaders. The depacketizer 324 may validate the packet headers bychecking packet identifiers, data length, data location, etc. within theheaders. In one embodiment, the header includes a hash code that can beused to validate that the packet delivered to the read data pipeline 108is the requested packet. The depacketizer 324 also removes the headersfrom the requested packet added by the packetizer 302. The depacketizer324 may directed to not operate on certain packets but pass theseforward without modification. An example might be a container label thatis requested during the course of a rebuild process where the headerinformation is required for index reconstruction. Further examplesinclude the transfer of packets of various types destined for use withinthe solid-state storage device 102. In another embodiment, thedepacketizer 324 operation may be packet type dependent.

The read data pipeline 108 includes an alignment module 326 thatreceives data from the depacketizer 324 and removes unwanted data. Inone embodiment, a read command sent to the solid-state storage media 110retrieves a packet of data. A device requesting the data may not requireall data within the retrieved packet and the alignment module 326removes the unwanted data. If all data within a retrieved page isrequested data, the alignment module 326 does not remove any data.

The alignment module 326 re-formats the data as data segments of a datastructure in a form compatible with a device requesting the data segmentprior to forwarding the data segment to the next stage. Typically, asdata is processed by the read data pipeline 108, the size of datasegments or packets changes at various stages. The alignment module 326uses received data to format the data into data segments suitable to besent to the requesting device 155 and joined to form a response. Forexample, data from a portion of a first data packet may be combined withdata from a portion of a second data packet. If a data segment is largerthan a data requested by the requesting device 155, the alignment module326 may discard the unwanted data.

In one embodiment, the read data pipeline 108 includes a readsynchronization buffer 328 that buffers one or more requested packetsread from the solid-state storage media 110 prior to processing by theread data pipeline 108. The read synchronization buffer 328 is at theboundary between the solid-state storage clock domain and the local busclock domain and provides buffering to account for the clock domaindifferences.

In another embodiment, the read data pipeline 108 includes an outputbuffer 330 that receives requested packets from the alignment module 326and stores the packets prior to transmission to the requesting device155. The output buffer 330 accounts for differences between when datasegments are received from stages of the read data pipeline 108 and whenthe data segments are transmitted to other parts of the solid-statestorage controller 104 or to the requesting device 155. The outputbuffer 330 also allows the data bus 204 to receive data from the readdata pipeline 108 at rates greater than can be sustained by the readdata pipeline 108 in order to improve efficiency of operation of thedata bus 204.

In one embodiment, the read data pipeline 108 includes a mediadecryption module 332 that receives one or more encrypted requestedpackets from the ECC correction module 322 and decrypts the one or morerequested packets using the encryption key unique to the solid-statestorage device 102 prior to sending the one or more requested packets tothe depacketizer 324. Typically the encryption key used to decrypt databy the media decryption module 332 is identical to the encryption keyused by the media encryption module 318. In another embodiment, thesolid-state storage media 110 may have two or more partitions and thesolid-state storage controller 104 behaves as though it was two or moresolid-state storage controllers 104 each operating on a single partitionwithin the solid-state storage media 110. In this embodiment, a uniquemedia encryption key may be used with each partition.

In another embodiment, the read data pipeline 108 includes a decryptionmodule 334 that decrypts a data segment formatted by the depacketizer324 prior to sending the data segment to the output buffer 330. The datasegment may be decrypted using an encryption key received in conjunctionwith the read request that initiates retrieval of the requested packetreceived by the read synchronization buffer 328. The decryption module334 may decrypt a first packet with an encryption key received inconjunction with the read request for the first packet and then maydecrypt a second packet with a different encryption key or may pass thesecond packet on to the next stage of the read data pipeline 108 withoutdecryption. When the packet was stored with a non-secret cryptographicnonce, the nonce is used in conjunction with an encryption key todecrypt the data packet. The encryption key may be received from aclient 114, a computer 112, key manager, or other device that managesthe encryption key to be used by the solid-state storage controller 104.

In another embodiment, the read data pipeline 108 includes adecompression module 336 that decompresses a data segment formatted bythe depacketizer 324. In one embodiment, the decompression module 336uses compression information stored in one or both of the packet headerand the container label to select a complementary routine to that usedto compress the data by the compression module 312. In anotherembodiment, the decompression routine used by the decompression module336 is dictated by the device requesting the data segment beingdecompressed. In another embodiment, the decompression module 336selects a decompression routine according to default settings on a perdata structure type or data structure class basis. A first packet of afirst object may be able to override a default decompression routine anda second packet of a second data structure of the same data structureclass and data structure type may use the default decompression routineand a third packet of a third data structure of the same data structureclass and data structure type may use no decompression.

In another embodiment, the read data pipeline 108 includes a readprogram module 338 that includes one or more user-definable functionswithin the read data pipeline 108. The read program module 338 hassimilar characteristics to the write program module 310 and allows auser to provide custom functions to the read data pipeline 108. The readprogram module 338 may be located as shown in FIG. 3, may be located inanother position within the read data pipeline 108, or may includemultiple parts in multiple locations within the read data pipeline 108.Additionally, there may be multiple read program modules 338 withinmultiple locations within the read data pipeline 108 that operateindependently. One of skill in the art will recognize other forms of aread program module 338 within a read data pipeline 108. As with thewrite data pipeline 106, the stages of the read data pipeline 108 may berearranged and one of skill in the art will recognize other orders ofstages within the read data pipeline 108.

The solid-state storage controller 104 includes control and statusregisters 340 and corresponding control queues 342. The control andstatus registers 340 and control queues 342 facilitate control andsequencing commands and subcommands associated with data processed inthe write and read data pipelines 106, 108. For example, a data segmentin the packetizer 302 may have one or more corresponding controlcommands or instructions in a control queue 342 associated with the ECCgenerator 304. As the data segment is packetized, some of theinstructions or commands may be executed within the packetizer 302.Other commands or instructions may be passed to the next control queue342 through the control and status registers 340 as the newly formeddata packet created from the data segment is passed to the next stage.

Commands or instructions may be simultaneously loaded into the controlqueues 342 for a packet being forwarded to the write data pipeline 106with each pipeline stage pulling the appropriate command or instructionas the respective packet is executed by that stage. Similarly, commandsor instructions may be simultaneously loaded into the control queues 342for a packet being requested from the read data pipeline 108 with eachpipeline stage pulling the appropriate command or instruction as therespective packet is executed by that stage. One of skill in the artwill recognize other features and functions of control and statusregisters 340 and control queues 342.

The solid-state storage controller 104 and or solid-state storage device102 may also include a bank interleave controller 344, a synchronizationbuffer 346, a storage bus controller 348, and a multiplexer (“MUX”) 350,which are described in relation to FIG. 4.

Bank Interleave

FIG. 4 is a schematic block diagram illustrating one embodiment 400 of abank interleave controller 344 in the solid-state storage controller 104in accordance with the present invention. The bank interleave controller344 is connected to the control and status registers 340 and to thestorage I/O bus 210 and storage control bus 212 through the MUX 350,storage bus controller 348, and synchronization buffer 346, which aredescribed below. The bank interleave controller 344 includes a readagent 402, a write agent 404, an erase agent 406, a management agent408, read queues 410 a-n, write queues 412 a-n, erase queues 414 a-n,and management queues 416 a-n for the banks 214 in the solid-statestorage media 110, bank controllers 418 a-n, a bus arbiter 420, and astatus MUX 422, which are described below. The storage bus controller348 includes a mapping module 424 with a remapping module 430, a statuscapture module 426, and a NAND bus controller 428, which are describedbelow.

The bank interleave controller 344 directs one or more commands to twoor more queues in the bank interleave controller 104 and coordinatesamong the banks 214 of the solid-state storage media 110 execution ofthe commands stored in the queues, such that a command of a first typeexecutes on one bank 214 a while a command of a second type executes ona second bank 214 b. The one or more commands are separated by commandtype into the queues. Each bank 214 of the solid-state storage media 110has a corresponding set of queues within the bank interleave controller344 and each set of queues includes a queue for each command type.

The bank interleave controller 344 coordinates among the banks 214 ofthe solid-state storage media 110 execution of the commands stored inthe queues. For example, a command of a first type executes on one bank214 a while a command of a second type executes on a second bank 214 b.Typically the command types and queue types include read and writecommands and queues 410, 412, but may also include other commands andqueues that are storage media specific. For example, in the embodimentdepicted in FIG. 4, erase and management queues 414, 416 are includedand would be appropriate for flash memory, NRAM, MRAM, DRAM, PRAM, etc.

For other types of solid-state storage media 110, other types ofcommands and corresponding queues may be included without straying fromthe scope of the invention. The flexible nature of an FPGA solid-statestorage controller 104 allows flexibility in storage media. If flashmemory were changed to another solid-state storage type, the bankinterleave controller 344, storage bus controller 348, and MUX 350 couldbe altered to accommodate the media type without significantly affectingthe data pipelines 106, 108 and other solid-state storage controller 104functions.

In the embodiment depicted in FIG. 4, the bank interleave controller 344includes, for each bank 214, a read queue 410 for reading data from thesolid-state storage media 110, a write queue 412 for write commands tothe solid-state storage media 110, an erase queue 414 for erasing anerase block in the solid-state storage, an a management queue 416 formanagement commands. The bank interleave controller 344 also includescorresponding read, write, erase, and management agents 402, 404, 406,408. In another embodiment, the control and status registers 340 andcontrol queues 342 or similar components queue commands for data sent tothe banks 214 of the solid-state storage media 110 without a bankinterleave controller 344.

The agents 402, 404, 406, 408, in one embodiment, direct commands of theappropriate type destined for a particular bank 214 a to the correctqueue for the bank 214 a. For example, the read agent 402 may receive aread command for bank-1 214 b and directs the read command to the bank-1read queue 410 b. The write agent 404 may receive a write command towrite data to a location in bank-0 214 a of the solid-state storagemedia 110 and will then send the write command to the bank-0 write queue412 a. Similarly, the erase agent 406 may receive an erase command toerase an erase block in bank-1 214 b and will then pass the erasecommand to the bank-1 erase queue 414 b. The management agent 408typically receives management commands, status requests, and the like,such as a reset command or a request to read a configuration register ofa bank 214, such as bank-0 214 a. The management agent 408 sends themanagement command to the bank-0 management queue 416 a.

The agents 402, 404, 406, 408 typically also monitor status of thequeues 410, 412, 414, 416 and send status, interrupt, or other messageswhen the queues 410, 412, 414, 416 are full, nearly full,non-functional, etc. In one embodiment, the agents 402, 404, 406, 408receive commands and generate corresponding sub-commands. In oneembodiment, the agents 402, 404, 406, 408 receive commands through thecontrol & status registers 340 and generate corresponding sub-commandswhich are forwarded to the queues 410, 412, 414, 416. One of skill inthe art will recognize other functions of the agents 402, 404, 406, 408.

The queues 410, 412, 414, 416 typically receive commands and store thecommands until required to be sent to the solid-state storage banks 214.In a typical embodiment, the queues 410, 412, 414, 416 are first-in,first-out (“FIFO”) registers or a similar component that operates as aFIFO. In another embodiment, the queues 410, 412, 414, 416 storecommands in an order that matches data, order of importance, or othercriteria.

The bank controllers 418 typically receive commands from the queues 410,412, 414, 416 and generate appropriate subcommands. For example, thebank-0 write queue 412 a may receive a command to write a page of datapackets to bank-0 214 a. The bank-0 controller 418 a may receive thewrite command at an appropriate time and may generate one or more writesubcommands for each data packet stored in the write buffer 320 to bewritten to the page in bank-0 214 a. For example, bank-0 controller 418a may generate commands to validate the status of bank 0 214 a and thesolid-state storage array 216, select the appropriate location forwriting one or more data packets, clear the input buffers within thesolid-state storage memory array 216, transfer the one or more datapackets to the input buffers, program the input buffers into theselected location, verify that the data was correctly programmed, and ifprogram failures occur do one or more of interrupting the mastercontroller 224, retrying the write to the same physical location, andretrying the write to a different physical location. Additionally, inconjunction with example write command, the storage bus controller 348will cause the one or more commands to multiplied to each of the each ofthe storage I/O buses 210 a-n with the logical address of the commandmapped to a first physical addresses for storage I/O bus 210 a, andmapped to a second physical address for storage I/O bus 210 b, and soforth as further described below.

Typically, bus arbiter 420 selects from among the bank controllers 418and pulls subcommands from output queues within the bank controllers 418and forwards these to the Storage Bus Controller 348 in a sequence thatoptimizes the performance of the banks 214. In another embodiment, thebus arbiter 420 may respond to a high level interrupt and modify thenormal selection criteria. In another embodiment, the master controller224 can control the bus arbiter 420 through the control and statusregisters 340. One of skill in the art will recognize other means bywhich the bus arbiter 420 may control and interleave the sequence ofcommands from the bank controllers 418 to the solid-state storage media110.

The bus arbiter 420 typically coordinates selection of appropriatecommands, and corresponding data when required for the command type,from the bank controllers 418 and sends the commands and data to thestorage bus controller 348. The bus arbiter 420 typically also sendscommands to the storage control bus 212 to select the appropriate bank214. For the case of flash memory or other solid-state storage media 110with an asynchronous, bi-directional serial storage I/O bus 210, onlyone command (control information) or set of data can be transmitted at atime. For example, when write commands or data are being transmitted tothe solid-state storage media 110 on the storage I/O bus 210, readcommands, data being read, erase commands, management commands, or otherstatus commands cannot be transmitted on the storage I/O bus 210. Forexample, when data is being read from the storage I/O bus 210, datacannot be written to the solid-state storage media 110.

For example, during a write operation on bank-0 the bus arbiter 420selects the bank-0 controller 418 a which may have a write command or aseries of write sub-commands on the top of its queue which cause thestorage bus controller 348 to execute the following sequence. The busarbiter 420 forwards the write command to the storage bus controller348, which sets up a write command by selecting bank-0 214 a through thestorage control bus 212, sending a command to clear the input buffers ofthe solid-state storage elements 110 associated with the bank-0 214 a,and sending a command to validate the status of the solid-state storageelements 216, 218, 220 associated with the bank-0 214 a. The storage buscontroller 348 then transmits a write subcommand on the storage I/O bus210, which contains the physical addresses including the address of thelogical erase block for each individual physical erase solid-stagestorage element 216 a-m as mapped from the logical erase block address.The storage bus controller 348 then muxes the write buffer 320 throughthe write sync buffer 308 to the storage I/O bus 210 through the MUX 350and streams write data to the appropriate page. When the page is full,then storage bus controller 348 causes the solid-state storage elements216 a-m associated with the bank-0 214 a to program the input buffer tothe memory cells within the solid-state storage elements 216 a-m.Finally, the storage bus controller 348 validates the status to ensurethat page was correctly programmed.

A read operation is similar to the write example above. During a readoperation, typically the bus arbiter 420, or other component of the bankinterleave controller 344, receives data and corresponding statusinformation and sends the data to the read data pipeline 108 whilesending the status information on to the control and status registers340. Typically, a read data command forwarded from bus arbiter 420 tothe storage bus controller 348 will cause the MUX 350 to gate the readdata on storage I/O bus 210 to the read data pipeline 108 and sendstatus information to the appropriate control and status registers 340through the status MUX 422.

The bus arbiter 420 coordinates the various command types and dataaccess modes so that only an appropriate command type or correspondingdata is on the bus at any given time. If the bus arbiter 420 hasselected a write command, and write subcommands and corresponding dataare being written to the solid-state storage media 110, the bus arbiter420 will not allow other command types on the storage I/O bus 210.Beneficially, the bus arbiter 420 uses timing information, such aspredicted command execution times, along with status informationreceived concerning bank 214 status to coordinate execution of thevarious commands on the bus with the goal of minimizing or eliminatingidle time of the busses.

The master controller 224 through the bus arbiter 420 typically usesexpected completion times of the commands stored in the queues 410, 412,414, 416, along with status information, so that when the subcommandsassociated with a command are executing on one bank 214 a, othersubcommands of other commands are executing on other banks 214 b-n. Whenone command is fully executed on a bank 214 a, the bus arbiter 420directs another command to the bank 214 a. The bus arbiter 420 may alsocoordinate commands stored in the queues 410, 412, 414, 416 with othercommands that are not stored in the queues 410, 412, 414, 416.

For example, an erase command may be sent out to erase a group of eraseblocks within the solid-state storage media 110. An erase command maytake 10 to 1000 times more time to execute than a write or a readcommand or 10 to 100 times more time to execute than a program command.For N banks 214, the bank interleave controller 344 may split the erasecommand into N commands, each to erase a virtual erase block of a bank214 a. While bank-0 214 a is executing an erase command, the bus arbiter420 may select other commands for execution on the other banks 214 b-n.The bus arbiter 420 may also work with other components, such as thestorage bus controller 348, the master controller 224, etc., tocoordinate command execution among the buses. Coordinating execution ofcommands using the bus arbiter 420, bank controllers 418, queues 410,412, 414, 416, and agents 402, 404, 406, 408 of the bank interleavecontroller 344 can dramatically increase performance over othersolid-state storage systems without a bank interleave function.

In one embodiment, the solid-state controller 104 includes one bankinterleave controller 344 that serves all of the storage elements 216,218, 220 of the solid-state storage media 110. In another embodiment,the solid-state controller 104 includes a bank interleave controller 344for each column of storage elements 216 a-m, 218 a-m, 220 a-m. Forexample, one bank interleave controller 344 serves one column of storageelements SSS 0.0-SSS M.0 216 a, 216 b, . . . 216 m, a second bankinterleave controller 344 serves a second column of storage elements SSS0.1-SSS M.1 218 a, 218 b, . . . 218 m etc.

Storage-Specific Components

The solid-state storage controller 104 includes a synchronization buffer346 that buffers commands and status messages sent and received from thesolid-state storage media 110. The synchronization buffer 346 is locatedat the boundary between the solid-state storage clock domain and thelocal bus clock domain and provides buffering to account for the clockdomain differences. The synchronization buffer 346, writesynchronization buffer 308, and read synchronization buffer 328 may beindependent or may act together to buffer data, commands, statusmessages, etc. In one embodiment, the synchronization buffer 346 islocated where there are the fewest number of signals crossing the clockdomains. One skilled in the art will recognize that synchronizationbetween clock domains may be arbitrarily moved to other locations withinthe solid-state storage device 102 in order to optimize some aspect ofdesign implementation.

The solid-state storage controller 104 includes a storage bus controller348 that interprets and translates commands for data sent to and readfrom the solid-state storage media 110 and status messages received fromthe solid-state storage media 110 based on the type of solid-statestorage media 110. For example, the storage bus controller 348 may havedifferent timing requirements for different types of storage, storagewith different performance characteristics, storage from differentmanufacturers, etc. The storage bus controller 348 also sends controlcommands to the storage control bus 212.

In one embodiment, the solid-state storage controller 104 includes a MUX350 that comprises an array of multiplexers 350 a-n where eachmultiplexer is dedicated to a row in the solid-state storage array 110.For example, multiplexer 350 a is associated with solid-state storageelements 216 a, 218 a, 220 a. MUX 350 routes the data from the writedata pipeline 106 and commands from the storage bus controller 348 tothe solid-state storage media 110 via the storage I/O bus 210 and routesdata and status messages from the solid-state storage media 110 via thestorage I/O bus 210 to the read data pipeline 108 and the control andstatus registers 340 through the storage bus controller 348,synchronization buffer 346, and bank interleave controller 344.

In one embodiment, the solid-state storage controller 104 includes a MUX350 for each row of solid-state storage elements (e.g. SSS 0.1 216 a,SSS 0.2 218 a, SSS 0.N 220 a). A MUX 350 combines data from the writedata pipeline 106 and commands sent to the solid-state storage media 110via the storage I/O bus 210 and separates data to be processed by theread data pipeline 108 from commands. Packets stored in the write buffer320 are directed on busses out of the write buffer 320 through a writesynchronization buffer 308 for each row of solid-state storage elements(SSS x.0 to SSS x.N 216, 218, 220) to the MUX 350 for each row ofsolid-state storage elements (SSS x.0 to SSS x.N 216, 218, 220). Thecommands and read data are received by the MUXes 350 from the storageI/O bus 210. The MUXes 350 also direct status messages to the storagebus controller 348.

The storage bus controller 348 includes a mapping module 424. Themapping module 424 maps a logical address of an erase block to one ormore physical addresses of an erase block. For example, a solid-statestorage media 110 with an array of twenty storage elements (e.g. SSS 0.0to SSS M.0 216) per block 214 a may have a logical address for aparticular erase block mapped to twenty physical addresses of the eraseblock, one physical address per storage element. Because the storageelements are accessed in parallel, erase blocks at the same position ineach storage element in a row of storage elements 216 a, 218 a, 220 awill share a physical address. To select one erase block (e.g. instorage element SSS 0.0 216 a) instead of all erase blocks in the row(e.g. in storage elements SSS 0.0, 0.1, . . . 0.N 216 a, 218 a, 220 a),one bank (in this case bank-0 214 a) is selected.

This logical-to-physical mapping for erase blocks is beneficial becauseif one erase block becomes damaged or inaccessible, the mapping can bechanged to map to another erase block. This mitigates the loss of losingan entire virtual erase block when one element's erase block is faulty.The remapping module 430 changes a mapping of a logical address of anerase block to one or more physical addresses of a virtual erase block(spread over the array of storage elements). For example, virtual eraseblock 1 may be mapped to erase block 1 of storage element SSS 0.0 216 a,to erase block 1 of storage element SSS 1.0 216 b, . . . , and tostorage element M.0 216 m, virtual erase block 2 may be mapped to eraseblock 2 of storage element SSS 0.1 218 a, to erase block 2 of storageelement SSS 1.1 218 b, . . . , and to storage element M.1 218 m, etc.Alternatively, virtual erase block 1 may be mapped to one erase blockfrom each storage element in an array such that virtual erase block 1includes erase block 1 of storage element SSS 0.0 216 a to erase block 1of storage element SSS 1.0 216 b to storage element M.0 216 m, and eraseblock 1 of storage element SSS 0.1 218 a to erase block 1 of storageelement SSS 1.1 218 b, . . . , and to storage element M.1 218 m, foreach storage element in the array up to erase block 1 of storage elementM.N 220 m.

If erase block 1 of a storage element SSS0.0 216 a is damaged,experiencing errors due to wear, etc., or cannot be used for somereason, the remapping module 430 could change the logical-to-physicalmapping for the logical address that pointed to erase block 1 of virtualerase block 1. If a spare erase block (call it erase block 221) ofstorage element SSS 0.0 216 a is available and currently not mapped, theremapping module 430 could change the mapping of virtual erase block 1to point to erase block 221 of storage element SSS 0.0 216 a, whilecontinuing to point to erase block 1 of storage element SSS 1.0 216 b,erase block 1 of storage element SSS 2.0 (not shown) . . . , and tostorage element M.0 216 m. The mapping module 424 or remapping module430 could map erase blocks in a prescribed order (virtual erase block 1to erase block 1 of the storage elements, virtual erase block 2 to eraseblock 2 of the storage elements, etc.) or may map erase blocks of thestorage elements 216, 218, 220 in another order based on some othercriteria.

In one embodiment, the erase blocks could be grouped by access time.Grouping by access time, meaning time to execute a command, such asprogramming (writing) data into pages of specific erase blocks, canlevel command completion so that a command executed across the eraseblocks of a virtual erase block is not limited by the slowest eraseblock. In other embodiments, the erase blocks may be grouped by wearlevel, health, etc. One of skill in the art will recognize other factorsto consider when mapping or remapping erase blocks.

In one embodiment, the storage bus controller 348 includes a statuscapture module 426 that receives status messages from the solid-statestorage media 110 and sends the status messages to the status MUX 422.In another embodiment, when the solid-state storage media 110 is flashmemory, the storage bus controller 348 includes a NAND bus controller428. The NAND bus controller 428 directs commands from the read andwrite data pipelines 106, 108 to the correct location in the solid-statestorage media 110, coordinates timing of command execution based oncharacteristics of the flash memory, etc. If the solid-state storagemedia 110 is another solid-state storage type, the NAND bus controller428 would be replaced by a bus controller specific to the storage type.One of skill in the art will recognize other functions of a NAND buscontroller 428.

Data Block Usage Information Synchronization

FIG. 5 is a schematic block diagram illustrating a logicalrepresentation 500 of a solid-state storage controller 506 in accordancewith the present invention. The storage controller 506 may be similar,in certain embodiments, to the solid-state storage controller 104depicted in FIG. 1 and FIG. 2 and may include one or more solid-statestorage controllers 104. The depicted embodiment shows a userapplication 502 in communication with a storage client 504. The storageclient 504 is in communication with a storage controller 506 thatincludes a logical-to-physical translation layer 512, an ECC correctionmodule 514, a read data pipeline 516, and a write data pipeline 518.

The storage controller 506 directly manages a solid-state storage array522. In one embodiment, the storage controller 506 directly manages asolid-state storage array 522 by managing and performing operations onthe solid-state storage media 110 in the solid-state storage array 522without any intervening independent hardware and/or software layers orinterfaces. In one embodiment, the storage controller 506 directlymanages the solid-state storage media 110 by directly performing storageoperations on the solid-state storage media 110. A storage controller506 that directly manages solid-state storage media 110 may includevarious hardware and software controllers, drivers, and software, suchas the depicted hardware controllers 520.

The non-volatile storage media may be embodied as solid-state storagemedia 110, a single solid-state storage die, a solid-state storagedrive, a hard disk drive, a set of hard disk drives, and the like. Thenon-volatile storage media may be one or more non-volatile storagevolumes embodied as one or more block-oriented volumes comprisingnon-volatile storage media that stores a plurality of data blocks. Inone embodiment, the one or more non-volatile storage volumes are flashstorage volumes, each including one or more flash storage media. Thenon-volatile storage media may also be embodied in one or more virtualor logical volumes formed by a physical volume/partition, or a pluralityof physical volumes/partitions. The non-volatile storage media may alsobe embodied in one or more hybrid or hybrid virtual volumes. Thenon-volatile storage media may reside in a single solid-state storagedevice or a plurality of solid-state storage devices. The non-volatilestorage media may also reside on other block-oriented devices andsystems such as a Storage Area Network (“SAN”).

In one embodiment, the depicted hardware controllers 520 may besubstantially similar to and include similar functionality as thesolid-state controllers 104 and accompanying controllers and modulesdepicted in FIG. 2 and/or the bank interleave controller 344 and storagebus controller 348 depicted in FIG. 3. Furthermore, the ECC correctionmodule 514 may be substantially similar and include similarfunctionality to the ECC correction module 322 and/or the ECC generator304 depicted in FIG. 3. In addition, the read data pipeline 516 and thewrite data pipeline 518 may be substantially similar to the read datapipeline 108 and the write data pipeline 106 depicted in FIG. 1 and FIG.3. The solid-state storage array 522 may include an array of solid-statestorage banks similar to the solid-state storage media 110 andcorresponding solid-state storage banks 214 depicted in FIG. 2.

In one embodiment, the user application 502 is a software applicationoperating on or in conjunction with the storage client 504. The storageclient 504 manages files and data and utilizes the functions andfeatures of the storage controller 506 and associated solid-statestorage array 522. Representative examples of storage clients include,but are not limited to, a server, a file system, an operating system, adatabase management system (“DBMS”), a volume manager, and the like. Thestorage client 504 is in communication with the storage controller 506.In one embodiment, the storage client 504 communicates through anInput/Output (I/O) interface represented by a block I/O emulation layer508.

Certain conventional block storage devices divide the storage media intovolumes or partitions. Each volume or partition may include a pluralityof sectors. One or more sectors are organized into a logical block. Incertain storage systems, such as those interfacing with the Windows®operating systems, the logical blocks are referred to as clusters. Inother storage systems, such as those interfacing with UNIX, Linux, orsimilar operating systems, the logical blocks are referred to simply asblocks. A logical block or cluster represents a smallest physical amountof storage space on the storage media that is managed by the storagemanager. A block storage device may associate n logical blocks availablefor user data storage across the storage media with a logical blockaddress, numbered from 0 to n. In certain block storage devices, thelogical block addresses may range from 0 to n per volume or partition.In conventional block storage devices, a logical block address mapsdirectly to a particular logical block. In conventional block storagedevices, each logical block maps to a particular set of physical sectorson the storage media.

However, storage device 102 does not directly or necessarily associatelogical block addresses with particular physical blocks. These storagedevices 102 may emulate a conventional block storage interface tomaintain compatibility with block storage clients 504.

When the storage client 504 communicates through the block I/O emulationlayer 508, the storage device 102 appears to the storage client 504 as aconventional block storage device. In one embodiment, the storagecontroller 506 provides a block I/O emulation layer 508 which serves asa block device interface, or API. In this embodiment, the storage client504 communicates with the storage device 102 through this block deviceinterface. In one embodiment, the block I/O emulation layer 508 receivescommands and logical block addresses from the storage client 504 inaccordance with this block device interface. As a result, the block I/Oemulation layer 508 provides the storage device 102 compatibility withblock storage clients 504.

In one embodiment, a storage client 504 communicates with the storagecontroller 506 through a direct interface layer 510. In this embodiment,the storage device 102 directly exchanges information specific tonon-volatile storage devices. A storage device 102 using directinterface 510 may store data on the solid-state storage media 110 asblocks, sectors, pages, logical blocks, logical pages, erase blocks,logical erase blocks, ECC chunks, logical ECC chunks, or in any otherformat or structure advantageous to the technical characteristics of thesolid-state storage media 110. The storage controller 506 receives alogical address and a command from the storage client 504 and performsthe corresponding operation in relation to the non-volatile solid-statestorage media 110. The storage controller 506 may support a block I/Oemulation layer 508, a direct interface 510, or both a block I/Oemulation layer 508 and a direct interface 510.

As described above, certain storage devices, while appearing to astorage client 504 to be a block storage device, do not directlyassociate particular logical block addresses with particular physicalblocks, also referred to in the art as sectors. Such storage devices mayuse a logical-to-physical translation layer 512. The logical-to-physicaltranslation layer 512 provides a level of abstraction between thelogical block addresses used by the storage client 504, and the physicalblock addresses at which the storage controller 506 stores the data. Thelogical-to-physical translation layer 512 maps logical block addressesto physical block addresses of data stored on solid-state storage media110. This mapping allows data to be referenced in a logical addressspace using logical identifiers, such as a logical block address. Alogical identifier does not indicate the physical location of data onthe solid-state storage media 110, but is an abstract reference to thedata.

The storage controller 506 manages the physical block addresses in thephysical address space. In one example, contiguous logical blockaddresses may in fact be stored in non-contiguous physical blockaddresses as the logical-to-physical translation layer 512 determinesthe location on the solid-state storage media 110 to perform dataoperations.

Furthermore, in one embodiment, the logical address space issubstantially larger than the physical address space. This “thinlyprovisioned” embodiment, allows the number of logical identifiers fordata references to greatly exceed the number of possible physicaladdresses.

In one embodiment, the logical-to-physical translation layer 512includes a map or index that maps logical block addresses to physicalblock addresses. The map may be in the form of a b-tree, a contentaddressable memory (“CAM”), a binary tree, and/or a hash table, and thelike. In certain embodiments, the logical-to-physical translation layer512 is a tree with nodes that represent logical block addresses andcomprise corresponding physical block addresses.

As stated above, in conventional block storage devices, a logical blockaddress maps directly to a particular physical block. When a storageclient 504 communicating with the conventional block storage devicedeletes data for a particular logical block address, the storage client504 may note that the particular logical block address is deleted andcan re-use the physical block associated with that deleted logical blockaddress without the need to perform any other action.

Conversely, when a storage client 504, communicating with a storagecontroller 104 with a logical-to-physical translation layer 512 (astorage controller 104 that does not map a logical block addressdirectly to a particular physical block), deletes a logical blockaddress, the corresponding physical block address remains allocatedbecause the storage client 504 does not communicate the change in usedblocks to the storage controller 506. The storage client 504 may not beconfigured to communicate changes in used blocks (also referred toherein as “data block usage information”). Because the storage client504 uses the block I/O emulation 508 layer, the storage client 504 mayerroneously believe that the storage controller 506 is a conventionalstorage controller that would not utilize the data block usageinformation. Or, in certain embodiments, other software layers betweenthe storage client 504 and the storage controller 506 may fail to passon data block usage information.

Consequently, the storage controller 104 preserves the relationshipbetween the logical block address and a physical address and the data onthe storage device 102 corresponding to the physical block. As thenumber of allocated blocks increases, the performance of the storagecontroller 104 may suffer depending on the configuration of the storagecontroller 104.

Specifically, in certain embodiments, the storage controller 506 isconfigured to store data sequentially, using an append-only writingprocess, and use a storage space recovery process that re-usesnon-volatile storage media storing deallocated/unused logical blocks.Specifically, as described above, the storage controller 506 maysequentially write data on the solid-state storage media 110 in a logstructured format and within one or more physical structures of thestorage elements, the data is sequentially stored on the solid-statestorage media 110.

As a result of storing data sequentially and using an append-onlywriting process, the storage controller 506 achieves a high writethroughput and a high number of I/O operations per second (IOPS). Thestorage controller 506 includes a storage space recovery, or garbagecollection process that re-uses data storage cells to provide sufficientstorage capacity. The storage space recovery process reuses storagecells for logical blocks marked as deallocated, invalid, unused, orotherwise designated as available for storage space recovery in thelogical-physical translation layer 512.

As described above, the storage space recovery process determines that aparticular section of storage may be recovered. Once a section ofstorage has been marked for recovery, the storage controller 506 mayrelocate valid blocks in the section. The storage space recoveryprocess, when relocating valid blocks, copies the packets and writesthem to another location so that the particular section of storage maybe reused as available storage space, typically after an erase operationon the particular section. The storage controller 506 may then use theavailable storage space to continue sequentially writing data in anappend-only fashion. Consequently, the storage controller 104 expendsresources and overhead in preserving data in valid blocks. Therefore,physical blocks corresponding to deleted logical blocks may beunnecessarily preserved by the storage controller 104, which expendsunnecessary resources in relocating the physical blocks during storagespace recovery.

Some storage devices 102 are configured to receive messages or commandsnotifying the storage device 102 of these unused logical blocks so thatthe storage device 102 may deallocate the corresponding physical blocks.As used herein, to deallocate a physical block includes marking thephysical block as invalid, unused, or otherwise designating the physicalblock as available for storage space recovery, its contents on storagemedia no longer needing to be preserved by the storage controller 506.Data block usage information, in reference to the storage controller506, may also refer to information maintained by the storage controller506 regarding which physical blocks are allocated and/ordeallocated/unallocated and changes in the allocation of physical blocksand/or logical-to-physical block mapping information. Data block usageinformation, in reference to the storage controller 506, may also referto information maintained by the storage controller 506 regarding whichblocks are in use and which blocks are not in use by a storage client.Use of a block may include storing of data in the block on behalf of theclient, reserving the block for use by a client, and the like.

While physical blocks may be deallocated, in certain embodiments, thestorage controller 506 may not immediately erase the data on the storagemedia. An erase operation may be performed later in time. In certainembodiments, the data in a deallocated physical block may be marked asunavailable by the storage controller 506 such that subsequent requestsfor data in the physical block return a null result or an empty set ofdata.

One example of a command or message for such deallocation is the “Trim”function of the “Data Set Management” command under the T13 technicalcommittee command set specification. A storage device, upon receiving aTrim command, may deallocate physical blocks for logical blocks whosedata is no longer needed by the storage client 504. A storage controller506 that deallocates physical blocks may achieve better performance andincreased storage space, especially storage controllers 506 that writedata using certain processes and/or use a similar data storage recoveryprocess as that described above.

Consequently, the performance of the storage controller 506 is enhancedas physical blocks are deallocated when they are no longer needed suchas through the Trim command or other similar deallocation commandsissued to the storage controller 506. However, certain storage clients504 such as operating systems or other software layers between thestorage controller 506 and the user application 502 are not designed toissue or forward on these commands. For example, a storage client 504may issue a deallocation command that never reaches the storagecontroller 104 due to the failure of a software layer to forward thecommand. Additionally, many storage clients 504 that have the ability toissue deallocation commands do so insufficiently or lack the ability toissue commands for certain storage configurations. For example, inevent-driven configurations that issue deallocation commands in responseto changes to block usage, when a deallocation command is dropped orlost (such as when a storage device is improperly shut down), theopportunity for the blocks corresponding to the dropped command to betrimmed has already passed until new changes are made which would allowthem to be reevaluated as a trim candidate. Furthermore, many storageclients 504 cannot issue deallocation commands for a live storage volumethat is actively servicing storage requests due to active storageoperations continually modifying the physical blocks and/or a blockmapping index such as the logical-physical translation layer 512.

A storage controller 506 whose performance is enhanced with deallocationcommands that never receives deallocation commands, may suffer decreasedperformance as the actions of the storage client 504 unsynchronize itsunused logical blocks with the physical blocks of the storage controller506. Therefore, as depicted in FIG. 5, embodiments of the presentinvention provide an alternate path 524 for communicating data blockusage information from the storage client 504 to the storage controller506. Those of skill in the art recognize that variations on theembodiments presented herein as examples also come within the scope andintent of the present invention as set forth in the claims. The presentinvention communicates the data block usage information such that thestorage controller 506 can use the data block usage information tooperate more efficiently. In one embodiment, the storage controller 506uses the data block usage information to synchronize the mapping oflogical block addresses in the logical-to-physical layer 512 to themapping maintained by the storage client 504. In another embodiment, thestorage controller 506 combines the data block usage information withother metadata in order to more efficiently manage the solid-statestorage array 522.

FIG. 6 is a schematic block diagram illustrating one embodiment of asystem 600 for data block usage information synchronization for anon-volatile storage volume in accordance with the present invention.The system 600 includes software operating in user mode, the softwareincluding utilities 602 with a block usage utility 606. The block usageutility 606 may include a block map 604. As is known in the art,software code in user mode is the non-kernel code in which applicationsand utilities operate with controlled access to system resources.Programs running in user mode typically cannot access the memory ofother programs directly, instead the programs must use API functioncalls.

The system 600, in certain embodiments, also includes software operatingin kernel mode, the software including a storage client 607 with astorage manager 608, and a storage controller 616 that includes a blockusage synchronizer 610 with an in-flight block map 612 and a combinedblock map 614. The storage controller 616 also includes a controlinterface 618 and a hardware interface manager 620. As is known in theart, code in kernel mode has full access to system resources and runsthe kernel and certain device drivers. Kernel mode memory is typicallyprotected from applications running in user mode.

Furthermore, the system 600 also includes a hardware interface 622 tosolid-state storage bank controllers 624 operating an interface 626 tosolid-state storage banks 630 in a solid-state storage array 628. Thesolid-state storage array 628 supports read, write or program, and eraseoperations and may include an array of solid-state storage banks 628similar to the solid-state storage array 522 depicted in FIG. 5 and thesolid-state storage media 110 and corresponding solid-state storagebanks 214 depicted in FIG. 2.

The solid-state storage bank controllers 624 may comprise solid-statestorage controller firmware and may be similar to and embodied by thesolid-state controllers 104 depicted in FIG. 2 and FIG. 3 and similarcontrollers and hardware depicted in FIG. 2, FIG. 3, and FIG. 4. Thehardware interface manager 620 and the hardware interface 622 cooperateto provide DMA data transfers, command queueing, command completionqueueing, interrupts, ECC correction, “append only” write functionality,and other functionality similar to that provided by the storagecontroller 506 of FIG. 5.

The storage controller 616 may also be similar to the storage controller506 depicted in FIG. 5. Specifically, the storage controller 616, incertain embodiments, registers with the host as a conventional blockdevice driver with the associated device by providing block deviceemulation, implements a log-structured storage system, maintains thelogical-to-physical map, implements storage space recovery, and otherfunctionality similar to that provided by the storage controller 506 ofFIG. 5. The storage controller 616 may also include all or a portion ofthe hardware interface manager 620, the hardware interface 622, and thesolid-state storage bank controllers 624.

The storage manager 608 manages the allocation of storage space for datastructures that are stored or will be stored in the future on one ormore storage devices, including a storage device such as storage device102. The storage manager 608 determines which logical blocks are in use,which logical blocks are unused, which logical blocks are reserved, andwhich logical blocks have changed state between used, unused, andreserved to a different state.

Typically, the storage manager 608 associates logical block addresseswith files, directories, and/or other storage data structures, such as,but not limited to, objects or other data structures that are stored orwill be stored in the future on the non-volatile storage media such asthe solid-state storage media 110 discussed above. The storage manager608 may include, interface with, or be included as part of a filesystem, DBMS, volume manager, or portions of a storage client 607 oroperating system that manage files, objects, and other data structuresthat require storage capacity to be allocated for non-volatile storageof the data structure. The storage manager 608 may maintain one or morelogical address to logical address mappings and/or one or more logicaladdress to physical address mappings for the storage data structures. Inthe depicted embodiment, the storage manager 608 resides in kernel modeand interfaces with applications operating in user mode. However, incertain embodiments, the storage manager 608 or portions thereof mayreside in user mode.

The storage manager 608 maintains, stores, records, provides and/ormanages data block usage information for logical blocks that are managedby the storage manager 608. Data block usage information includesinformation regarding which one or more logical blocks areallocated/used and/or which logical blocks are unallocated/unused.

As used herein, a logical block is allocated when it is considered avalid block, when the logical block stores content corresponding toexisting data of a file or other data structure, when the logical blockis unavailable for storing other content, when the logical blockreserves storage capacity on behalf of one or more storage clients 607,and the like. Likewise, a logical block is unallocated when it isconsidered an invalid block, when the logical block does not storecontent corresponding to existing data of a file or other datastructure, when the logical block is available for storing othercontent, when the logical block does not reserve storage capacity onbehalf of one or more storage clients 607 and the like.

Data block usage information, in one embodiment, includes free blocksand used blocks. Free blocks are blocks that are unallocated blocks.Unallocated blocks includes blocks that that were previously allocatedand have now been freed as well as blocks that have not yet beenallocated. The data block usage information may also include theidentity of blocks currently allocated. Those of skill in the artrecognize that given the number of blocks for a volume and the blocksequencing, free blocks can readily be derived from used blocks and viceversa.

Data block usage information may be in the form of metadata. In certainembodiments, the data block usage information maintained by the storagemanager 608 is accessible (retrievable and/or referenceable) byutilities 602 or applications 502 separate from the storage manager 608.

These utilities 602, which interface with the storage controller 616 byway of the control interface 618, provide certain management,maintenance, optimization, configuration, and tuning functionality forstorage devices coupled to or in communication with a host system. Theutilities 602 may include defragmentation utilities, volumereconfiguration utilities, disk performance utilities, and the like.

The utilities 602 may interface with the storage manager 608 to obtaindata about a file system, disk, or volume. As stated above, theutilities 602 may read, access, obtain, or otherwise reference the datablock usage information maintained by the storage manager 608.Specifically, in one embodiment, the utilities 602 reference the datablock usage information by way of a storage Application ProgrammingInterface (“API”) of the storage manager 608, through, for example, afunction call.

In one embodiment, the storage API is a pre-existing API provided by thestorage manager 608 that describes data block usage information for acompletely different purpose, and in particular data block usageinformation for block devices. In one embodiment, the API is a publicAPI for block storage maintenance utilities. In a further embodiment,the storage API is configured for storage media such as a hard diskdrive which is a different storage media technology than solid-statestorage media 110. In one embodiment, the API is a defragmentation APIthat block storage maintenance utilities use to defragment hard diskdrive volumes.

In one embodiment, the utilities 602 include a block usage utility 606.The block usage utility 606 interacts with the storage manager 608 andcommunicates data block usage information from the storage manager 608to the storage controller 616 such that the storage controller 616 canuse the data block usage information to operate more efficiently. Theblock usage utility 606, in the depicted embodiment, operates in usermode as a utility. In other embodiments, all or a portion of the blockusage utility 606 operates in kernel mode.

The block usage utility 606 facilitates access to the data block usageinformation that is managed by the storage manager 608. The block usageutility 606 may directly interface with the storage manager 608 toreference, retrieve, copy, access, and/or obtain a pointer to the datablock usage information. Alternatively, the block usage utility 606cooperates with the storage client 607 to obtain or reference the datablock usage information.

The block usage utility 606 provides the data block usage information tothe storage controller 616. The storage controller 616 utilizes the datablock usage information to operate more efficiently. For example, in oneembodiment, the storage controller 616 may use the data block usageinformation to synchronize a mapping of logical block addresses in thelogical-to-physical layer 512 to a mapping maintained by the storagemanager 608 and/or storage client 607. Of course, the storage controller616 may use the data block usage information in other ways as well toimprove operation of the storage device 102.

As described above, the data block usage information may includeallocated blocks or unallocated blocks. The block usage utility 606 maydetermine the identity of allocated blocks by using the identity ofunallocated blocks and the block usage utility 606 may determine theidentity of unallocated blocks by using the identity of allocatedblocks. For example, using the volume size and the allocated blockinformation, the block usage utility 606 may determine which blocks areunallocated blocks.

In one embodiment, the block usage utility 606 references the data blockusage information directly, such as in a shared memory structure. In afurther embodiment, the block usage utility 606 references the datablock usage information, in user mode, through an API of the storageclient 607 and/or the storage manager 608. The block usage utility 606may operate as an application or service in user mode or the equivalentfunctionality may be embedded directly into other modules such as thestorage controller 616. In other words, the storage controller 616 mayreference block usage information via a block usage utility 606 ordirectly from the storage client 607 or the storage manager 608. Thismay be done in one embodiment, by mapping the block usage informationuser level API into kernel space and calling the user level API directlyfrom the storage controller 616, or by some other similar mechanism.

Those of skill in the art recognize that the data block usageinformation may be represented and communicated in many forms. Forexample, in response to the function call, the storage client 607 and/orthe storage manager 608 may return a data structure or identifier for adata structure that provides the data block usage information. The datastructure storing the data block usage information may include a list,file, object, table, bit map, and the like. One skilled in the artrealizes that the data block usage information is not restricted to anyparticular data structure, but may be embodied as one or more datastructures known in the art. Furthermore, the data block usageinformation may represent used/unused block information for a singlelogical block, a set of logical blocks, the logical blocks for aparticular volume, logical blocks for a set of volumes, and the like.

In one embodiment, a data structure returned by the API function call isa block map 604. The API function call serves as an interface betweenthe storage manager 608 and the block usage utility 606. The block map604 is a bit map with each bit representing an allocable unit and thebinary value for the bit representing whether the allocable unit is anallocated block or an unallocated block. An allocable unit may include ablock, one or more blocks, a cluster, or the like. The block map 604 mayrepresent every allocable unit of a volume, a subset of allocable unitsof a volume, or allocable units corresponding to a particular set ofunits for a volume such as a set of logical block addresses. Forexample, the block usage utility 606 may execute a function call to thestorage API requesting a block map for ten logical blocks associatedwith a set of ten logical block addressees. The API may return a10.times.1 block bit map indicating an active bit for the logical blocksthat are allocated.

In one embodiment, the block usage utility 606 requests, references, orexecutes a function call for a block map 604 for all logical blocksmanaged by the storage manager 608. In certain embodiments, the blockusage utility 606 requests, references, or executes a function call fora block map 604 for a set or group of logical blocks. For example, thestorage manager 608 may provide, through the storage API, a block map604 defining data block usage information for a storage volume, a groupof storage volumes, or for a set of logical blocks in a storage volume.In one embodiment, the storage API receives a contiguous range oflogical blocks and returns a block map 604 indicating block usage forthat range of logical blocks.

In certain embodiments, the block usage utility 606 calls a block usagefunction of the storage API designed and intended for use indefragmenting a block-oriented storage device. Instead, the block usageutility 606 uses the same storage API function calls for communicatingdeallocation messages and/or storage block allocation synchronizationwithin the storage controller 616.

In one embodiment, the storage API is a defragmentation API forblock-oriented storage devices. For example, certain utilities 602 mayreference data block usage information from the defragmentation API inorder to execute block defragmentation operations. Advantageously, thisdata block usage information is used by the present invention tofacilitate block usage synchronization. Re-purposing thisdefragmentation API for communicating data block usage information tothe storage controller 616 enables the present invention to operate inexisting storage architectures that provide a defragmentation API but donot support communication of data block usage information for improvingoperation of storage devices such as storage device 102 that can use thedata block usage information for more efficient operation.

The block usage utility 606, operating in user mode, may reference theblock map 604 for a set of logical blocks of a volume and communicatethe data block usage information from the block map 604 to the storagecontroller 616. The block usage utility 606 may identify the set oflogical blocks by providing the storage API a set of clusters or blocksfor a partition/volume. Alternatively, the block usage utility 606 mayuse other methods to identify the set of logical blocks and/or logicalblock addresses.

In one embodiment, with the block usage information of the block map604, the block usage utility 606 sends a Trim command or otherdeallocation command for the unused blocks. The block usage utility 606may send the Trim command in response to the storage controller 616supporting the Trim command. The block usage utility 606 may iteratethrough the logical block addresses of a volume, selecting a set oflogical block addresses to evaluate, and sending messages to the storagecontroller 616 identifying unused blocks from each set of logicalblocks.

Advantageously, the block usage utility 606 may issue deallocationcommands using data block usage information obtained directly from thestorage manager 608. The storage controller 616 does not need to rely ondeallocation commands or notifications issued by other storage softwarelayers. Similarly, the block usage utility 606 may also complement thedeallocation methodologies of storage clients 607.

In another embodiment, the block usage utility 606 initiates a blockusage synchronizer 610, which is described in greater detail below, tosynchronize the data block usage information of the storage controller616 with the data bock usage information of the storage manager 608. Inone embodiment, the block usage utility 606 initiates the block usagesynchronizer 610 by way of issuing the Trim command or by simply makinga function call.

The block usage utility 606 may initiate the block usage synchronizer610 in response to one or more predetermined events or at apredetermined time interval. In certain embodiments, the block usageutility 606 operates in such a manner that minimizes the workload on thestorage controller 616 and/or computer 112 resources. In addition, theblock usage utility 606 may operate such that the synchronizationoperations of the block usage synchronizer 610 impose a minimal workloadon the storage controller 616 and/or computer 112 resources. In certainembodiments the block usage utility 606 minimizes the workload bypassing a reference to the block map 604 to the block usage synchronizer610 operating in kernel mode rather than passing a copy of the block map604.

The block usage synchronizer 610 facilitates synchronization of thestorage manager's 608 data block usage information and the storagecontroller's 616 data block usage information, which includes, in oneembodiment, the physical block allocation mappings managed by thestorage controller 616. Therefore, in one embodiment, the block usagesynchronizer 610 facilitates synchronization between the physical blockallocation mappings managed by the storage controller 616 and thelogical block allocation mappings managed by the storage manager 608. Inone embodiment, the block usage synchronizer 610 uses the data blockusage information to synchronize the mapping of logical block addressesin the logical-to-physical layer 512 (See FIG. 5) to the mappingmaintained by the storage client 607.

In the depicted embodiment, the block usage synchronizer 610 executes inkernel mode. In other embodiments, a portion of the block usagesynchronizer 610 may execute in user mode. In the depicted embodiment,the block usage synchronizer executes inside of the storage controller616. However, in alternate embodiments, the block usage synchronizer 610may execute outside the storage controller 616.

The block usage synchronizer 610 accesses the data block usageinformation directly by way of the storage API or through the blockusage utility 606. In one embodiment, the block usage synchronizer 610accesses or receives data block usage information from the block usageutility 606 operating in user mode. In another embodiment, the blockusage synchronizer 610 calls a function of, references, and/or accessesthe data block usage information directly from within kernel mode. Forexample, the block usage synchronizer 610 may call the storage APIdirectly from kernel mode to reference the block map 604.

Advantageously, the block usage synchronizer 610 may deallocate unusedphysical blocks and synchronize data block usage information when thestorage controller 616 communicates with storage clients 607 that do notissue deallocation commands and without reliance on deallocationcommands that may not reach the storage controller 616. Similarly, theblock usage synchronizer 610 may also complement the deallocationmethods of storage clients 607, operating systems, or file servers toprovide more efficient block usage synchronization or to ensure fullcoverage of the block space.

In certain embodiments, the storage controller 616 manages a live volumeactively servicing storage requests. To keep data block usageinformation current with storage operations on the live volume, theblock usage synchronizer 610 may combine the data block usageinformation with other metadata reflecting added potential changes todata block usage information. In one embodiment, the block usagesynchronizer 610 monitors certain storage operations after the block map604 is referenced. The block usage synchronizer 610 may manage, provide,and/or implement information about the block usage of “in-flight”storage operations not included in the data block usage information. Asused herein, “in-flight” storage operations are storage operations whosedata block usage information is not included in the data block usageinformation due to the timing of the storage operations. In-flightoperations may include storage operations that modify a logical blockand are executed by the storage controller 616 subsequent to, subsequentin time to, or after, the moment in time when the block usagesynchronizer 610 and/or the block usage utility 606 references the datablock usage information. Similarly, these in-flight storage operationsmay be executed by the storage controller 616 prior to the moment intime when the block usage synchronizer 610 or storage controller 616deallocates the unused blocks based on the data block usage informationor when the block usage utility 606 communicates the data block usageinformation to the storage controller 616.

As with the data block usage information, the in-flight information maybe represented, stored, and/or communicated in many forms such as a datastructure or identifier for a data structure. The data structure storingthe in-flight information may include a list, file, object, table, orthe like. One skilled in the art realizes that the in-flight informationis not restricted to any particular data structure, but may be embodiedas one or more data structures known in the art. Furthermore, thein-flight information may represent storage operations, the blocksmodified by storage operations, or the like. Examples of the storageoperations may include writing data or reserving storage space inpreviously unused certain logical blocks. In one embodiment, thein-flight information indicates changes to the current set of logicalblocks analyzed by the block usage synchronizer 610 and corresponding tothe logical blocks represented by the data block usage information.

In one embodiment, the block usage synchronizer 610 maintains thein-flight information as a block map. The block usage synchronizer 610may use this in-flight block map 612 to update the data block usageinformation referenced through the storage manager 608. The block usagesynchronizer 610 may modify the data block usage information in thein-flight block map 612 for certain storage operations that changeunused blocks represented in the block map 604 to used blocks.

For example, if a storage operation in a FIFO command queue is not yetexecuted when the block usage synchronizer 610 references the block map604, the block map 604 may become inaccurate because the storageoperation may execute before a Trim command issued by the block usagesynchronizer 610. To account for in-flight block usage changes, theblock usage synchronizer 610 cooperates with the storage controller 616to maintain the block usage information in the in-flight block map 612.The in-flight block map 612 may be used to update the data block usageinformation of the block map 604.

In one embodiment, the block usage synchronizer 610 combines the blockmap 604 and the in-flight block map 612 to produce a combined block map614 used to identify unused blocks. In certain embodiments, the combinedblock map 614 is a separate data structure such as a separate bit map ofthe same size as the block map 604 and in-flight block map 612.Alternatively, the block usage synchronizer 610 merges the in-flightblock map 612 into the block map 604 by way of an operation such as anOR binary operation. In such an embodiment, the block map 604 becomesthe combined block map 614 instead of using a separate data structure.

By monitoring in-flight data operations with the in-flight block map 612and building the combined block map 614 the block usage synchronizer 610has the most current block usage information for performing the blockusage synchronization. In addition, the block usage informationaccurately represents unused blocks as identified by the storage manager608. In one embodiment, the block usage utility 606 may detect ordetermine storage operations and thereby maintain, manage, and/or storethe in-flight block map 612 and/or the combined block map 614.

FIG. 7 is a schematic block diagram illustrating one embodiment of asystem 700 for data block usage information synchronization for anon-volatile storage volume using a RAID controller in accordance withthe present invention. FIG. 7 includes a block usage utility 606, ablock map 604, and a storage manager 608, which may be similar to theblock usage utility 606, the block map 604, and the storage manager 608of FIG. 6. FIG. 7 also includes a RAID storage controller 702 managing aplurality of sub-controllers 705 a-n in a RAID configuration 704. Eachsub-controller 705 performs storage operations and/or stores data on oneor more solid-state storage devices 706 through a hardware interface 622similar to the hardware interface 622 depicted in FIG. 6.

In one embodiment, the sub-controllers 705 include functionality andfeatures similar to the storage controller 616 described above. However,the sub-controllers 705 may be configured to operate with the RAIDstorage controller 702. Furthermore, in one embodiment, eachsub-controller 705 is configured to manage and operate a singlesolid-state storage device 706. Alternatively, or in addition, asub-controller 705 may manage and operate a plurality of solid-statestorage devices 706. For example, a single sub-controller 705 mayoperate a two or more solid-state storage devices 706 in a RAIDconfiguration such as a RAID 0, 1, 5, and or cooperate with the RAIDstorage controller 702 to implement a composite RAID configuration suchas RAID 10 or 01.

Although a RAID storage controller 702 managing a plurality ofsub-controllers 705 is depicted in FIG. 7, one of ordinary skill in theart realizes that a single RAID storage controller 702 may also manage aplurality of storage devices 706 in a RAID configuration 704 withoutusing sub-controllers 705. Furthermore, the depicted RAID configuration704 may comprise a RAID 0, RAID 1, RAID 10 (1+0) or RAID 5configuration. In addition, the RAID storage controller 702 andsub-controllers 705 may be implemented in hardware, software, or acombination of hardware and software.

In certain RAID configurations 704 (i.e. RAID 1), storage devices 706may store identical data blocks as other storage devices 706 in the RAIDarray (such as a mirror storage device). For example, storage device 706b may mirror storage device 706 a and storage device 706 d may mirrorstorage device 706 c. In other RAID configurations 704, each storagedevice 706 in the RAID array may store different data blocks than otherstorage devices in the RAID array such as in a RAID 0, 3, 4, or 5configuration in which data is striped across storage devices 706.Consequently, certain portions of the data block usage information maypertain to certain storage devices 706. For example, data may be stripedacross storage device 706 a, 706 b, 706 c and 706 d with a stride storedon each storage device 706.

Therefore when data is stored in a RAID configuration 704, the blockusage utility 606 ensures that data block usage information iscommunicated to each storage sub-controller 705 of the RAID inaccordance with the RAID configuration 704. For example, for a RAID 1configuration, the block usage utility 606 communicates the data blockusage information to each storage sub-controller 705 participating inthe mirroring configuration. Similarly, for a RAID 0 configuration, theblock usage utility 606 communicates the applicable portion of datablock usage information to each applicable storage sub-controller 705participating in the stripe configuration. Furthermore, in otherembodiments, the block usage synchronizer 610 (see FIG. 6) may ensurethat data block usage information is synchronized for each storagedevice 706 of the RAID array (including mirror storage devices) and thatthe data block usage information for each storage device 706 issynchronized with its corresponding portion of the data block usageinformation from the storage manager 608 when data is striped.

In one embodiment, the RAID storage controller 702 is configured to passalong unused block information to the appropriate sub-controllers 705and/or storage devices 706. In this embodiment, the block usage utility606 communicates data block usage information to the RAID storagecontroller 702. The RAID storage controller 702 may then communicate thedata block usage information or unused block information to eachsub-controller 705 and/or storage device 706. The RAID storagecontroller 702 may also determine the portion of the data block usageinformation to send to each sub-controller 705/storage device 706.Similarly, in one embodiment, the block usage synchronizer 610 maysynchronize the data block usage information on the RAID storagecontroller 702, which then updates the unused blocks for each storagedevice 706 or notifies the sub-controller 705 for each storage device706 of the unused blocks. The unused block information for each storagedevice 706 in the RAID configuration 704 may be maintained by thesub-controllers 705 or the RAID storage controller 702 or by both incooperation with each other.

In one embodiment, the block usage utility 606 communicates data blockusage information/unused block information to the RAID storagecontroller 702 and also directs the RAID storage controller 702regarding one or more portions of the data block usageinformation/unused block information to send to each sub-controller705/storage device 706.

In another embodiment, the block usage utility 606 directly communicatesthe data block usage information/unused block information to thesub-controller 705 managing each storage device 706. The sub-controller705 receives the data block usage information/unused block informationfor the blocks stored on the storage device 706 under its control.Likewise, the block usage synchronizer 610 may also directly synchronizedata block usage information for each storage device 706 of the RAID bycommunicating directly with the sub-controller 705 for each storagedevice 706.

In certain embodiments, the block usage utility 606 directlycommunicates the data block usage information/unused block informationto one sub-controller 705 a which then acts as a master sub-controller705 and communicates the data block usage information/unused blockinformation to the other sub-controllers 705 b-n. Similarly, the blockusage synchronizer 610 may also synchronize data block usage informationwith a master sub-controller 705 a that directs the othersub-controllers 705 b-n accordingly.

The block usage utility 606/block usage synchronizer 610 may determine aRAID configuration 704 (also referred to as a device layout) of the RAIDstorage controller 702 and communicate data block usageinformation/unused block information or synchronize data block usageinformation based on the determined RAID configuration. In anotherembodiment, the RAID configuration is predetermined.

In one embodiment, the RAID configuration 704 comprises a RAID 0configuration that stores data as a stripe across two or more storagedevices 706. In this RAID configuration, as is known in the art, eachstorage device 706 stores a portion of the data for the stripe.Similarly, data block usage information pertaining to data that spansmultiple storage devices 706 in the stripe is divided among the storagedevices 706 of the stripe. In one embodiment, the block usage utility606/block usage synchronizer 610 identifies portions of the data blockusage information corresponding to data blocks stored on each storagedevice 706 and then sends a message or synchronizes with the appropriatestorage controller block usage information based on the blocks stored oneach storage device 706.

In one embodiment, the RAID configuration 704 comprises a RAID 1configuration that mirrors data stored on a first storage device 706 ato a second storage device 706 b or that mirrors data stored on a firstplurality of storage devices 706 a-b to a second plurality of storagedevices 706 c-d. The block usage utility 606/block usage synchronizer610 may communicate similar unused block information to or make similarsynchronization changes of block usage information for the first storagedevice 706 a and the second (mirror) storage device 706 b or for thefirst plurality of storage devices 706 a-b and the second (mirror)plurality of storage devices 706 c-d.

In one embodiment, the RAID configuration 704 comprises a RAID 5configuration that stores data as a stripe across three or more storagedevices 706 a-c. The stripe comprises two or more data strides and adistributed parity data stride and each data stride is stored on astorage device 706. For example, a first storage device 706 a and asecond storage device 706 b may each store a data stride and a thirdstorage device 706 c may store a parity data stride. The sub-controller705 for each storage device 706 storing a particular stride may maintainthe data block usage information for that particular stride.

The parity calculation of the parity data stride is dependent on thedata in each stride forming the stripe. In one embodiment, as blocks ofa stripe change state from used to unused, the parity stride may berecalculated and rewritten. In another embodiment, the block usageutility 606/block usage synchronizer 610 determines that each datastride in the stripe has no used blocks. If all of the data blocks ofthe data strides in the stripe are unused, the block usage utility 606may then communicate that the stripe is unused and thus overhead inmanaging the parity data stride is avoided.

Similarly, the block usage synchronizer 610 may synchronize the datablock usage information for the storage devices 706 storing data stridesof the stripe without affecting the parity calculation of the paritydata stride because the whole stripe is unused. In certain embodiments,after determining that the stripe has no used blocks, the block usageutility 606/block usage synchronizer 610 designates data block usageinformation corresponding to the stripe as unused. The data block usageinformation corresponding to the stripe may be maintained by the RAIDcontroller 702 and/or the sub-controllers 705.

FIG. 8 is a schematic block diagram illustrating another embodiment of asystem 800 for data block usage information synchronization for anon-volatile storage volume using a RAID controller in accordance withthe present invention. Specifically, FIG. 8 depicts one embodiment of aRAID 10 (1+0) configuration. FIG. 8 includes similar components as FIG.6 and FIG. 7, specifically a block usage utility 606, a block map 604,and a storage manager 608. FIG. 8 also includes a RAID storagecontroller 802 managing four solid-state storage devices 810 in a RAID10 configuration. In the depicted embodiment, the RAID storagecontroller 802 includes a top-level RAID 0 controller 804 with sub-RAID1 controllers 806, each controlling sub-controllers 808 in communicationwith the storage devices 810. In one embodiment, the sub-controllers 808may be similar to the sub-controllers 705 described above in relation toFIG. 7. In addition, although four storage devices 810 are depicted, aRAID 10 configuration may include four or more storage devices 810.

The RAID 10 configuration may mirror a stride of data between two ormore storage devices 810 a,b and storage devices 810 c,d using a RAID 1configuration and store stripes of data across two or more storagedevice sets 812 using a RAID 0 configuration. For example, storagedevice 810 a may include a first data stride mirrored onto storagedevice 810 b and storage device 810 c may include a second data stridemirrored onto storage device 810 d. In one embodiment, the block usageutility 606/block usage synchronizer 610 identifies portions of the datablock usage information corresponding to data blocks stored in each datastride, sends a message to the corresponding RAID 0 controller 804, RAID1 controller 806, and/or sub-controller 808 or synchronizes data blockusage information based on the blocks stored on each data stride, andthen sends a similar message or performs similar synchronizationoperations for the mirrored data strides.

FIG. 9 is a schematic block diagram illustrating one embodiment of anapparatus for data block usage information synchronization for anon-volatile storage volume in accordance with the present invention.The apparatus depicts one embodiment of the block usage synchronizer 610in FIG. 6. The apparatus includes a reference module 902 and asynchronization module 904 which are described below. The description ofthe apparatus also refers to elements of FIG. 6, like numbers referringto like elements.

The reference module 902 facilitates access to data block usageinformation maintained by the storage manager 608. Specifically, thereference module 902 may reference, retrieve, copy, access, and/orcreate a pointer to data block usage information which may includeunused or unallocated data block information maintained by the storagemanager 608. The reference module 902 may reference this information fordata blocks (associated with logical block addresses) of a non-volatilestorage volume managed by a storage manager 608 or other non-volatilestorage media including solid-state storage media 110.

In one embodiment, the reference module 902 references data block usageinformation for a set of logical block addresses for the non-volatilestorage volume. In certain embodiments, the reference module 902references data block usage information for a subset of logical blocksfrom a total number of logical blocks maintained by the storage manager608. For example, the reference module 902 may reference a set oflogical blocks, a group of logical blocks, a range of logical blocks,logical blocks associated with a volume, and the like.

The non-volatile storage volume may be a block-oriented volumecomprising non-volatile storage media that stores a plurality of datablocks. In one embodiment, the non-volatile storage volume is a flashstorage volume including one or more flash memory storage media. In oneembodiment, the non-volatile storage volume is a storage device such asa hard disk drive or a solid-state storage drive. In one embodiment, thenon-volatile storage volume is a live/online/mounted volume activelyservicing storage requests.

As described above, in one embodiment, the storage manager 608 maintainsthe data block usage information for the storage client 607. In otherembodiments, the storage manager 608 stores, records, provides and/ormanages the data block usage information for logical blocks stored byone or more storage clients 607 and/or storage managers 608.

As stated above, the data block usage information may include theidentity of used blocks or allocated blocks, unused blocks or freeblocks, freed blocks, or unallocated blocks that the storage manager 608has not allocated. In one embodiment, the reference module 902references data block usage information comprising freed blocksunallocated by the storage manager 608 within a certain period of timeor subsequent to a certain event. For example, the reference module 902may reference data block usage information for freed blocks unallocatedsince the last time the reference module 902 referenced unallocated datablock usage information.

In one embodiment, referencing data block usage information requires aplurality of steps. The reference module 902 may first reference datablock usage information providing the identity of allocated blocks. Thereference module 902 may then determine the identity of unused, orunallocated data blocks. The reference module 902 may then determine ifthe unused blocks are recently freed or have never been allocated.

In one embodiment, the reference module 902 references data block usageinformation by way of a storage Application Programming Interface(“API”) of the storage manager 608. Alternatively, the reference module902 references data block usage information by way of a storageApplication Programming Interface (“API”) of the storage client 607. Inone embodiment, the storage API is a pre-existing API included with thestorage manager 608. In certain embodiments, the storage API is notintended for deallocation commands or block synchronization. In oneembodiment, the storage API is a defragmentation API for block-orientedstorage devices 102.

In one embodiment, the reference module 902 operates in kernel mode andthe reference module 902 references data block usage information, inkernel mode, through the API. In another embodiment, a portion of thereference module 902 operates in user mode, such as in the block usagesynchronization utility 606, and references the API in user mode. Inthis embodiment, the portion of the reference module 902 in user modeprovides, copies or otherwise makes available, a pointer to the datablock usage information or a copy of the data block usage information tothe portion of the reference module 902 in kernel mode.

As stated above, the reference module 902 may reference the data blockusage information as a block map 604 including a bit map that uses bitsto represent allocated blocks or unallocated blocks, although other datastructures besides a block map may be used. The reference module 902 mayrequest a block map 604 for a specific set of logical blocks.

The synchronization module 904 synchronizes data block usage informationmanaged by the storage controller 616 with the data block usageinformation maintained by the storage manager 608. Data block usageinformation managed by the storage controller 616 may includeinformation in the logical-to-physical translation layer 512 regardinglogical block address to physical block address mapping. As statedabove, in one embodiment, the storage manager 608 and the storagecontroller 616 communicate through a block-device interface and thestorage controller 616 uses a logical-to-physical translation layer 512that maps logical block addresses to physical block addresses of datastored on solid-state storage media 110. As a result, in thisembodiment, the storage manager 608 maintains the data block usageinformation separate from data block usage information managed by thestorage controller 616 and the data block usage information of thestorage manager 608 and the data block usage information of the storagecontroller 616 can become unsynchronized, particularly when the storagemanager 608 and/or block-device interface does not support deallocationmessage passing.

The synchronization module 904, in one embodiment, uses the data blockusage information, which represents unallocated or unused logical datablocks, and deallocates the corresponding physical blocks on thenon-volatile solid-state storage media 110 managed by the storagecontroller 616. The synchronization module 904 may also deallocate thecorresponding physical blocks or cause the corresponding physical blocksto be deallocated. The synchronization module 904 may directlydeallocate the physical blocks. In another embodiment, thesynchronization module 904 issues a command or sends a message for thestorage controller 616 to deallocate the physical blocks. In a furtherembodiment, the storage controller 616 returns a confirmation when thephysical blocks have been successfully deallocated. Those of skill inthe art recognize various ways that the synchronization module 904 candeallocate the physical blocks in relation to the logical blockidentifiers or addresses including updating of flags or other metadatarelating to the data block usage status.

In one embodiment, the synchronization module 904 synchronizes thelogical-to-physical translation layer 512 maintained by the storagecontroller 616. Specifically, in one embodiment, the synchronizationmodule 904 deallocates unused blocks by removing entries for the unusedblocks in a logical-to-physical map or index or by removing nodes forthe unused blocks in a logical-to-physical tree data structure. Inanother embodiment, the synchronization module 904 causes the storagecontroller 616 to deallocate the unused blocks by removing, marking orupdating entries for the unused blocks in the logical-to-physical map orindex.

Referring also to FIG. 7, in one embodiment, the storage controller 616includes a RAID storage controller 702 storing data in a RAIDconfiguration 704. The synchronization module 904 may synchronize thedata block usage information managed for the storage devices 706 in theRAID array with the data block usage information from the storagemanager 608. In one embodiment, the synchronization module 904determines a RAID configuration 704 of either the RAID storagecontroller 702 or RAID storage controller 702 and sub-controllers 705.The synchronization module 904 may then synchronize the data block usageinformation based on the determined RAID configuration 704. The RAIDconfiguration 704 may include information on the types of volumes in theRAID array, the RAID configuration (such as RAID 0, RAID 1), the numberof storage devices 706, and the like.

As described above, the synchronization module 904 may synchronize datablock usage information by communicating, signaling, or sending amessage to the RAID storage controller 702. For example, thesynchronization module 904 may communicate with the RAID storagecontroller 702 to synchronize one or more storage devices 706 in theRAID array. The RAID storage controller 702 may identify and/ordeallocate unused blocks in each appropriate storage device 706 in theRAID array. In another embodiment, the synchronization module 904 mayalso communicate, signal, or send a message to the RAID storagecontroller 702 and indicate the appropriate storage device 706 andportion of the data block usage information for each storage device 706in the RAID array.

In another embodiment, the synchronization module 904 communicates,signals, or sends a message directly to a storage controller 616managing a storage device 706 in the RAID array to synchronize the datablock usage information.

In one embodiment, the RAID configuration 704 comprises a RAID 0configuration that stores data as a stripe across two or more storagedevices 706. The synchronization module 904 may synchronize the datablocks of a storage device 706 in the RAID array with data block usageinformation for the data blocks of the storage device 706. Specifically,the synchronization module 904 may identify a first portion of the datablock usage information from the storage manager 608 that corresponds todata blocks stored on a first storage device 706 a of the RAID array.The synchronization module 904 may identify a second portion of the datablock usage information corresponding to data blocks stored on a secondstorage device 706 b.

The synchronization module 904 may synchronize data block usageinformation managed for the first storage device 706 a with the firstportion of the data block usage information from the storage manager608. The synchronization module 904 may also synchronize data blockusage information managed for the second storage device 706 b with thesecond portion of the data block usage information from the storagemanager 608. As a result, the synchronization module 904 synchronizesdata block usage information for each storage device 706 with theportion of the data block usage information particular to the blocksstored by each storage device 706.

In one embodiment, the RAID configuration 704 comprises a RAID 1configuration. In this embodiment, the synchronization module 904 alsosynchronizes one or more mirror storage devices 706. Specifically, thesynchronization module 904 may synchronize data block usage informationmanaged for a first storage device 706 a with the data block usageinformation from the storage manager 608. The synchronization module 904may also synchronize data block usage information managed for a secondstorage device 706 b (storing mirrored data of the first storage device706 a) with the data block usage information of the storage manager 608.However, in one embodiment, a storage device 706 a and a mirror storagedevice 706 b may share common data block usage information (such as acommon logical-to-physical mapping tree). Consequently, when thesynchronization module 904 synchronizes such storage devices 706, themirror storage device 706 b is automatically synchronized when thesynchronization module 904 synchronizes the mirrored storage device 706a. Therefore, in this embodiment, the synchronization module 904 wouldnot actively synchronize the mirror storage device 706 b.

In one embodiment, the RAID configuration 704 comprises a RAID 5configuration that stores data as a stripe across three or more storagedevices 706 and includes a distributed parity stride along with two ormore data strides. The synchronization module 904 may ensure that paritycalculations for the parity data stride remain accurate. Specifically,the synchronization module 904, in one embodiment, determines, based onthe data block usage information from the storage manager 608, that eachdata stride in the stripe has no used blocks. The synchronization module904 may synchronize data block usage information managed for these datastrides of the stripe with a corresponding portion of the data blockusage information of the storage manager 608 by designating oridentifying the blocks in the stripe as unused. Therefore, if the entirestripe (not including the parity stride) is made up of unused blocks,the synchronization module 904 may identify the entire stripe as unusedwithout destroying the parity calculation for the parity data stride.

Referring now to FIGS. 8 and 9, in one embodiment, the RAIDconfiguration comprises a RAID 10 configuration that mirrors a stride ofdata between two or more storage devices 810 a,b using a RAID 1configuration and that stores stripes of data across two or more storagedevice sets 812 using a RAID 0 configuration. The synchronization module904 may synchronize data block usage information for each storage device810 a,c and also synchronize data block usage information for each ofthe mirror storage devices 810 b,d.

Specifically, the synchronization module 904, in one embodiment,identifies a first portion of the data block usage information from thestorage manager 608 corresponding to data blocks stored in a firststride managed by the RAID storage controller 802. For example, asub-controller 808 a may maintain data block usage information for thefirst stride on a first storage device 810 a. The synchronization module904 may identify a second portion of the data block usage informationfrom the storage manager 608 corresponding to data blocks stored on asecond stride managed by the RAID storage controller 802. For example, asub-controller 808 c may maintain data block usage information for thesecond stride on a second storage device 810 c. The synchronizationmodule 904 may synchronize data block usage information managed for thefirst stride with the first portion of the data block usage informationfrom the storage manager 608. The synchronization module 904 maysynchronize data block usage information managed for the second stridewith the second portion of the data block usage information from thestorage manager 608.

In one embodiment, the synchronization module 904 synchronizes the datablock usage information for the storage devices 810 b,d mirroring thefirst and second storage devices 810 a,c. In another embodiment, asstated above, a storage device 810 a, c and a mirror storage device 810b,d may share common data block usage information. Consequently, whenthe synchronization module 904 synchronizes a storage device 810 a,c,the mirrored storage device 810 b,d is synchronized also.

FIG. 10 is a detailed schematic block diagram illustrating anotherembodiment of an apparatus for data block usage informationsynchronization for a non-volatile storage volume in accordance with thepresent invention. The apparatus includes the reference module 902 andthe synchronization module 904, wherein these modules includesubstantially the same features as those described above in relation toFIG. 9. Additionally, the synchronization module 904 includes a blockdetermination module 1002 and a deallocation module 1004, and theapparatus includes an update module 1006 and a block usage utility 606with a user mode reference module 1008 and an initiation module 1010.The description of the apparatus also refers to elements of FIGS. 6 and9, like numbers referring to like elements.

The block determination module 1002 determines one or more unused blocksfrom the data block usage information. In one embodiment, the blockdetermination module 1002 determines unused blocks by referencing bitsin the bit map 604. The block map 604 may be a bit map with each bitrepresenting an allocable block and the binary value for the bitrepresenting whether the allocable block is an allocated block or anunallocated block. If the block map 604 shows allocated blocks, thedetermination module 1002 may determine unallocated blocks from theallocated block information.

The deallocation module 1004, directly or indirectly, deallocatesphysical blocks in the storage controller 616 that correspond to unusedlogical blocks identified from the data block usage information from thestorage manager 608. By deallocating the corresponding physical blocks,the deallocation module 1004 synchronize data block usage informationmanaged by the storage controller 616 with the data block usageinformation maintained by the storage manager 608.

In one embodiment, the deallocation module 1004 sends a message directlyto the storage controller 616 directly managing the non-volatile storagevolume. The message indicates to the storage controller 616 unusedblocks identified by the storage manager 608. The storage controller 616deallocates the unused blocks identified by the storage manager 608 inresponse to the message.

For example, the deallocation module 1004 may send a message indicatingunused logical blocks to the storage controller 616. The storagecontroller 616 may then deallocate the physical blocks that are mappedto the logical blocks within the mapping used by the logical-to-physicaltranslation layer 512 in response to the message. Those of skill in theart recognize a variety of different techniques for deallocating thelogical blocks in a logical-to-physical index (See FIG. 2). In oneembodiment, the storage controller 616 deallocates a logical block byremoving an entry for the logical block from the logical-to-physicalindex, map, or similar data structure. In certain embodiments, thestorage controller 616 sends a reply or confirmation to the deallocationmodule 804 indicating that the blocks have been successfullydeallocated.

In one embodiment, the deallocation module 1004 deallocates unusedblocks in data block usage information maintained by the storagecontroller 616 corresponding to the unused blocks in the data blockusage information from the storage manager 608. If the storagecontroller 616 already shows the block as unallocated or unused, markingthe block as unallocated or unused again causes no ill effects. Incertain embodiments, simply updating the data block usage informationmaintained by the storage controller 616 may be more efficient thanchecking first to determine if the data block usage information differs.

In another embodiment, the deallocation module 1004 first determineswhether the storage controller 616 indicates particular blocks as usedblocks in contrast to the storage manager 608 showing the blocks asunused. Specifically, in certain embodiments, the deallocation module1004 deallocates blocks that the storage controller 616 had maintainedas used blocks. In these embodiments, the deallocation module 1004determines that the storage controller 616 identifies unused blocksindicated by the data block usage information as used blocks anddeallocates the used blocks identified by the storage controller 616corresponding to the one or more unused blocks.

As stated above, in certain embodiments, the logical-to-physicaltranslation layer 512 depicted in FIG. 5 is a tree with nodes thatrepresent logical block addresses and comprise corresponding physicalblock addresses. In one embodiment, the deallocation module 1004deallocates unused blocks by removing entries for the unused blocks inthe logical-to-physical map. In another embodiment, the deallocationmodule 1004 causes the storage controller 616 to deallocate the unusedblocks by removing entries for the unused blocks in thelogical-to-physical map.

The update module 1006 updates data block usage information to accountfor operations of a live storage volume/partition actively servingstorage requests. In one embodiment, the update module 1006 monitorsin-flight storage operations that modify the data block usageinformation. As described above, these in-flight storage operations maybe executed by the storage controller 616 subsequent to referencing thedata block usage information. These in-flight storage operations may beexecuted by the storage controller 616 prior to synchronizing the datablock usage information.

In one embodiment, these in-flight storage operations include thestorage operations executed by the storage controller 616 subsequent toreferencing the block map 604 and executed by the storage controller 616prior to deallocating the unused blocks as indicated by the storagemanager 608. These in-flight storage operations may not be included inthe data block usage information, having been launched or queued forexecution before the data block usage information was accessed. However,these in-flight storage operations may still modify blocks that changethe data block usage information because they are executed before thedata block usage information is synchronized (the unused blocks asindicated by the storage manager 608 are deallocated or marked asunused). Therefore, the update module 1006 accounts for these in-flightstorage operations.

In one embodiment, the update module 1006 monitors storage operations ondata blocks represented in the block map 604. Specifically, in oneembodiment, the update module 1006 monitors the in-flight storageoperations for the particular set of data blocks for the block map 604referenced by the reference module 702.

The update module 1006 records data block usage information for thestorage operations that change unused blocks of the block map 604 toused blocks. In one embodiment, the update module 1006 records the datablock usage information of these storage operations in an in-flightblock map 612 described above.

The user mode reference module 1008 facilitates access to the data blockusage information when the storage API is accessible in user mode. Theuser mode reference module 1008 resides in the block usage utility 606and references the storage API in user mode. For example, in certainembodiments, the user mode reference module 1008 calls a storage APIfunction. In one embodiment, the user mode reference module 1008provides, copies, or otherwise makes available the data block usageinformation or a pointer to the data block usage information to the usermode reference module 1008 in kernel mode. In another embodiment, theuser mode reference module 1008 provides, copies, or otherwise makesavailable the data block usage information or a pointer to the datablock usage information to the synchronization module 904.

The initiation module 1010 initiates the processes of the block usagesynchronizer 610. Referring to FIGS. 6 and 10, in certain embodiments,the initiation module 1010 initiates the block usage synchronizer 610 inresponse to one or more predetermined events. For example, theinitiation module 1010 may initiate the block usage synchronizer 610 inresponse to a performance threshold, an amount of storage space droppingbelow a threshold level, a certain number of file deletions, following astartup sequence, a dual boot transition phase, and the like.

In certain embodiments, the initiation module 1010 may initiate theblock usage synchronizer 610 at a predetermined time interval. Forexample, the initiation module 1010 may initiate the block usagesynchronizer 610 at a predetermined time every day or every hour, aftera certain amount of “up” time by the computer system, and the like. Theinitiation module 1010 may also determine a set of logical blocks forthe block usage synchronizer 610 and send an indication of these logicalblocks to the reference module 902. The initiation module 1010 mayselect sets of logical block addresses for analysis during a scan of oneor more volumes.

FIG. 11 is a schematic block diagram illustrating an embodiment of anapparatus 1100 for data management on non-volatile storage mediamaintained by a storage manager 608 in accordance with the presentinvention. The apparatus 1100 depicts one embodiment of the block usageutility 606 in FIG. 6. The apparatus 1100 includes a reference module1102 and a message module 1104. The description of the apparatus 1100also refers to elements of FIG. 6, like numbers referring to likeelements.

The reference module 1102 facilitates access to data block usageinformation managed by the storage manager 608. Specifically, thereference module 1102 may reference, retrieve, copy, access, and/orcreate a pointer to data block usage information of the storage manager608. The reference module 1102 may be similar to the reference module902 depicted in FIG. 9. In one embodiment, the reference module 1102operates in user mode and references data block usage information inuser mode. In another embodiment, the reference module 1102 operates inkernel mode and references data block usage information from kernelmode.

In one embodiment, the reference module 1102 references data block usageinformation for a set of logical block addresses for the non-volatilestorage volume. In certain embodiments, the reference module referencesdata block usage information for a subset of logical blocks from a totalnumber of logical blocks for a volume maintained by the storage manager608. For example, the reference module may reference data block usageinformation for a set of logical blocks, a group of logical blocks, arange of logical blocks, and the like.

In one embodiment, the reference module 1102 references the data blockusage information by way of a storage API of the storage manager 608. Inone embodiment, the reference module 1102 references a block map, suchas block map 604, defining data block usage information for the logicaldata blocks selected by the reference module 1102. The reference module1102 may request a block map 604 for a specific set of logical blocks.One or more events may trigger or activate the reference module 1102. Inaddition, or alternatively, the reference module 1102 may operateaccording to a predetermined schedule.

The data block usage information may include the identity of freeblocks, freed blocks, or blocks that the storage manager 608 has notallocated. In one embodiment, the reference module 1102 references freedblocks deallocated by the storage manager 608 within a certain period oftime or subsequent to a certain event.

In one embodiment, referencing data block usage information includes aplurality of steps. The reference module 1102 may first reference datablock usage information showing the identity of allocated logicalblocks. The reference module 1102 may next determine the identity ofunused, or unallocated data blocks. The reference module 1102 may thendetermine if the unused blocks are recently freed blocks or logicalblocks that have never been allocated.

In one embodiment, the reference module 1102 determines one or moreunused blocks from the block map 604. The unused blocks may be logicalblocks. In one embodiment, the reference module 1102 determines unusedblocks by reading bits in the bit map 604. Each bit may, depending onthe embodiment, represent a used block (one that corresponds to validdata), or an unused block.

In one embodiment, the reference module 1102 does not determine unusedblocks from the block map 604. In this embodiment, the reference module902 receives a list of unused blocks from the storage manager 608 whichthe reference module 902 passes to the message module 1104 directlywithout the need to determine unused blocks.

The message module 1104 communicates the data block usage information tothe storage controller 616. In one embodiment, the message module 1104sends a message directly to the storage controller 616 managing thenon-volatile storage media. The message may include unused blockinformation identifying to the storage controller 616 the unused logicalblocks that the storage manager 608 identifies. The message module 1104may receive a list of unused blocks directly from the reference 1102module. In one embodiment, the message module 1104 sends a message foreach logical block identified that is no longer in use as defined by thestorage manager 608. In another embodiment, the message module 1104sends a message for a set of logical blocks.

In certain embodiments, the message complies with an interface operableto communicate storage information between the storage manager 608 andthe storage controller 616. In one embodiment, the message is a Trimmessage or command. In one embodiment, the message comprises anotification passing the block usage information to the storagecontroller 616. In one embodiment, the message comprises a notificationpassing unused block information to the storage controller 616. Theunused block information may include the unused blocks identified by thestorage manager. In certain embodiments, the notification includes norequirement for action by the storage controller 616 in accordance withthe interface. As a result, the storage controller 616 may or may notdeallocate the physical blocks identified from the unused blockinformation. In accordance with the interface, the storage controller616 determines if deallocating the physical blocks is advantageous.

In one embodiment, according to an interface, the message includes adirective passing block usage information and/or unused blockinformation to the storage controller 616. The block usage informationand/or unused block information may include the unused blocks identifiedby the storage manager. In this embodiment, the directive requires thestorage controller 616 to erase the non-volatile storage mediacomprising the unused blocks in accordance with the interface. As aresult, the message module 1104 may ensure that the storage controller616 erases non-volatile storage media corresponding to the unusedblocks. In one embodiment, the storage controller 616 passes a response,message, or confirmation that indicates the storage controller 616 hascomplied with the directive and erased the non-volatile storage media.

In one embodiment, the storage controller 616 may delay or deferperforming the erase operation of the non-volatile storage mediacomprising the unused blocks until later in time or until the storagemedia for the unused blocks is needed. Instead, the storage controller616 may update the logical-to-physical map to mark the appropriatelogical blocks as unused blocks. In certain embodiments, marking thelogical blocks as unused is sufficient to erase the logical blockswithout erasing the media because the storage controller 616 isconfigured to respond to read requests for those logical blocks with anindication that no data exists for example by returning all zeros ornull values instead of the data stored on the non-volatile storagemedia. In certain embodiments the marking that the logical blocks asunused may be lost due to not recording the marking in non-volatilememory prior to a power loss. Consequently, when the storage controller616 reconstructs an index used by the logical-to-physical translationlayer 512 by scanning the solid-state storage media 110 in the orderthat the data was written, the storage controller 616 may still identifythe logical blocks as used. However, the storage manager 608 indicatesthe logical blocks as unused so no read requests will be made for theselogical blocks.

In one embodiment, according to an interface, the message includes apurge instruction passing the block usage information and/or unusedblock information to the storage controller 616. The block usageinformation and/or unused block information may include the unusedblocks identified by the storage manager. In this embodiment, the purgeinstruction requires the storage controller 616 to perform an eraseoperation on the non-volatile storage media comprising the unused blocksand to overwrite the unused blocks one or more times using a predefinedpattern in accordance with the interface. In one embodiment, the storagecontroller 616 uses one or more iterations of writing one or moredifferent data patterns in order to completely alter the binary valuesin the unused blocks to ensure that the original data is unrecoverable.

As a result, the message module 1104 ensures that the storage controller616 overwrites data corresponding to the unused blocks. Advantageously,the storage controller 616 may overwrite sensitive data to prevent thechance of unauthorized access. In one embodiment, the purge instructionrequires the storage controller 616 to identify and overwrite previousversions of data stored in earlier locations in a log-based storageformat, as described above, to ensure a complete overwrite and secureerasure of the data. In one embodiment, the storage controller 616passes a response, message, or confirmation that indicates the storagecontroller 616 has complied with the purge instruction and overwrittenthe non-volatile storage media.

Referring also to FIG. 7, in one embodiment, the storage controller 616includes a RAID storage controller 702 storing data in a RAIDconfiguration 704. The message module 1104 may send one or more messagescommunicating the unused blocks identified by the storage manager 608 toa RAID storage controller 702 or to one or more sub-controllers 705. Inone embodiment, the message module 1104 determines a RAID configuration704 of either the RAID storage controller 702 or the RAID storagecontroller 702 and sub-controllers 705. The message module 1104 may thensend messages to communicate the unused blocks based on the determinedRAID configuration 704.

As described above, the message module 1104 may send a message to theRAID storage controller 702 if the RAID storage controller 702 isconfigured to update unused block information for the appropriatestorage device 706 in the RAID array. In another embodiment, the messagemodule 1104 may send a message directly to a sub-controller 705 managinga storage device 706 in the RAID array.

In one embodiment, the RAID configuration 704 comprises a RAID 0configuration that stores data as a stripe across two or more storagedevices 706. The message module 1104 may send a message to asub-controller 705 of a storage device 706 in the RAID array with unusedblock information specific to that storage device 706. Specifically, themessage module 1104 may identify a first portion of the block map 604that corresponds to data blocks stored on a first storage device 706 amanaged by a RAID controller 702 or sub-controller 705. The messagemodule 1104 may identify a second portion of the block map 604corresponding to data blocks stored on a second storage device 706 b.

The message module 1104 may send a first message to the RAID controller702 or sub-controller 705 a identifying one or more unused blocks on thefirst storage device 706 a identified by the first portion of the blockmap 604. The message module 1104 may send a second message to the RAIDcontroller 702 or sub-controller 705 b identifying one or more unusedblocks on the second storage device 706 b identified by the secondportion of the block map 604. As a result, the message module 1104 maycustomize the messages sent to the RAID controller 702 or sub-controller705 for each storage device 706.

In one embodiment, the RAID configuration 704 comprises a RAID 1configuration with one or more mirrored storage devices 706. In thisembodiment, the message module 1104 may also send a message to the RAIDcontroller 702 or sub-controller 705 managing one or more mirror devices706. Specifically, the message module 1104 may send a first message tothe RAID controller 702 or sub-controller 705 managing a first storagedevice 706 a, the message identifying one or more unused blocks on thefirst storage device 706 a identified by the block map 604. The messagemodule 1104 may send a second message to the RAID controller 702 orsub-controller 705 managing the second storage device 706 b (storingmirrored data of the first storage device 706 a) identifying one or moreunused blocks on the second storage device 706 b identified by the blockmap 604.

However, as described above, in one embodiment, a storage device 706 aand a mirrored storage device 706 b may share common data block usageinformation. Consequently, the message for one storage device 706 a mayhave equal applicability to the mirror storage device 706 b without theneed for additional messages.

In one embodiment, the RAID configuration 704 comprises a RAID 5configuration that stores data as a stripe across three or more storagedevices 706 and includes a distributed parity stride along with two ormore data strides. To maintain parity integrity, the message module1104, in one embodiment, determines, based on the block map 604, thateach data stride in the stripe has no used blocks. The message module1104 may send a message to the RAID storage controller 702 designatingdata blocks corresponding to the stripe as unused.

Referring now to FIGS. 8 and 11, in one embodiment, the RAIDconfiguration comprises a RAID 10 configuration that mirrors a stride ofdata between two or more storage devices 810 a,b using a RAID 1configuration and that stores stripes of data across two or more storagedevice sets 812 using a RAID 0 configuration. The message module 1104may send messages particular to the stride on each storage device 810a,b and also send messages communicating unused blocks for the mirrorstorage devices 810 c,d.

Specifically, the message module 1104, in one embodiment, identifies afirst portion of the block map 604 corresponding to data blocks storedin a first stride managed by the RAID storage controller 802. Forexample, a sub-controller 808 a may maintain data block usageinformation for data blocks of the first stride on a first storagedevice 810 a. The message module 1104 may identify a second portion ofthe block map 604 corresponding to data blocks stored in a second stridemanaged by the RAID storage controller 802. For example, asub-controller 808 c may maintain data block usage information for datablocks of the second stride on a second storage device 810 c. Themessage module 1104 may send a first message to the sub-controller 808 amanaging the first stride identifying one or more unused blocks in thefirst stride identified by the first portion of the block map 604. Themessage module 1104 may also send a second message to the sub-controller808 c managing the second stride identifying one or more unused blocksin the second stride identified by the second portion of the block map604.

In one embodiment, the message module 1104 also sends messages for thestorage devices 810 b,d mirroring the first and second storage devices810 a,c. In another embodiment, as stated above, a storage device 810a,b and a mirrored storage device 810 b,d may share common data blockusage information.

FIG. 12 is a detailed schematic block diagram illustrating anotherembodiment of an apparatus 1200 for data management on non-volatilestorage media managed by a storage manager 608 in accordance with thepresent invention. The apparatus 1200 includes the reference module 1102and the message module 1104, wherein these modules include substantiallythe same features as described in relation to FIG. 11. Additionally, theapparatus 1200 includes a determination module 1202 that includes amonitor module 1204, a record module 1206, and a map combination module1208. The apparatus 1200 includes a deallocation module 1210 thatincludes a lock module 1212. The description of the apparatus 1200 alsorefers to elements of FIGS. 6 and 11, like numbers referring to likeelements.

The determination module 1202 determines one or more unused blocks fromthe block map 604. The unused blocks may be logical blocks. In oneembodiment, the determination module 1202 determines unused blocks byreading bits in the bit map 604. Each bit may, depending on theembodiment, represent a used block (one that corresponds to valid data),or an unused block.

The monitor module 1204 monitors storage operations on data blocksrepresented by the block map 604 to account for operation of a livevolume actively servicing storage requests. Specifically, in oneembodiment, the monitor module 1204 monitors the in-flight storageoperations for the particular set of data blocks for the block map 604referenced by the reference module 1102. These in-flight storageoperations include the storage operations executed by the storagecontroller 616 subsequent to referencing the block map 604 and executedby the storage controller 616 prior to deallocating the unused blocks.

The record module 1206 records data block usage information for thein-flight storage operations that change unused blocks of the block map604 to used blocks. In one embodiment, the record module 1206 recordsthe data block usage information of these storage operations in anin-flight block map 612 as described above. The record module 1206 mayrecord the logical block addresses of logical blocks affected by thein-flight storage operations monitored by the monitor module 1204. Inone embodiment, the in-flight block map 612 is a bit map having the samesize and structure as the block map 604. Accordingly, the record module1206 may record used blocks by setting a corresponding bit in thein-flight block map 612.

The map combination module 1208 updates the block map 604 (See FIG. 6)to reflect changes from in-flight storage operations. In one embodiment,the map combination module 1208 combines the block map 604 and thein-flight block map 612 to identify the unused blocks of the datablocks. In one embodiment, the map combination module 1208 combines theblock map and the in-flight block map 612 into a combined block map 614that identifies the unused blocks of the data blocks being monitored. Inone embodiment, the block map 604 is ‘OR’ ed with the in-flight blockmap 612 to combine the maps and determine updated data block usageinformation.

The deallocation module 1210 deallocates unused physical blocks tosynchronize the data block usage information managed by the storagecontroller 616 with the data block usage information maintained by thestorage manager 608. In certain embodiments, the deallocation module1210 deallocates blocks that the storage controller 616 maintains asused blocks, or blocks that hold data that the storage controller 616 ispreserving. Specifically, in one embodiment, the deallocation module1210 deallocates used blocks identified by the storage controller 616corresponding to unused blocks identified by the storage manager 608based on data block usage information. In another embodiment, thedeallocation module 1210 determines that the storage controller 616identifies used blocks indicated by the data block usage information asunused blocks and deallocates the used blocks identified by the storagecontroller 616 corresponding to the one or more unused blocks.

In one embodiment, the deallocation module 1210 deallocates blocks byremoving entries for the unused blocks in the logical-to-physical map.In another embodiment, the deallocation module 1210 signals the storagecontroller 616 to perform the deallocation. In one embodiment, thedeallocation module 1210 updates unused block information and/or datablock usage information recorded on the non-volatile storage media inplace of or in addition to updates to the unused block informationand/or data block usage information in the logical-to-physical map. Inthis embodiment, the deallocation module 1210 may indicate, in log-basedstorage, that the unused blocks are deallocated and available forstorage space recovery. The deallocation module 1210 may update astorage space recovery data structure stored in volatile memory or innon-volatile memory. The storage space recovery data structure may trackfor the storage controller 506 which physical parts of the storage mediaare available for storage space recovery. For example, the storage spacerecovery data structure may record which logical erase blocks (“LEB”) orparts of LEBs are available for data recovery. In one embodiment, thedeallocation module 1210 updates the storage space recovery datastructure in response to, or in conjunction with, deallocating blocks byremoving entries for the unused blocks in the logical-to-physical map.

The lock module 1212 maintains data integrity during changes to thelogical-to-physical translation layer 512 of the storage controller 616.In one embodiment, the lock module 1212 obtains a lock on thelogical-to-physical map data structure managed by the storage controller616 prior to updating the block map 604 to include in-flight storageoperations. The lock module 1212 releases the lock on thelogical-to-physical map subsequent to the storage controller 616deallocating the unused blocks. The lock module 1212 ensures thatchanges to the logical-to-physical map are synchronized so as to notcause errors or data failures from other processes accessing thelogical-to-physical map. In one embodiment, the lock module 1212 obtainsthe lock before the map combination module 1208 combines the block map604 and the in-flight block map 612 so that no other in-flightoperations modify the logical-to-physical map.

FIG. 13A is a schematic flow chart diagram illustrating one embodimentof a method 1300 for data block usage information synchronization for anon-volatile storage volume in accordance with the present invention.The description of the method 1300 refers to elements of FIGS. 6 and 9,like numbers referring to like elements.

The method 1300 begins and the reference module 902 references 1302 datablock usage information for data blocks of a non-volatile storage volumemanaged by a storage manager 608. The storage manager 608 maintains thedata block usage information, which the reference module 902 mayreference through a storage API of the storage manager 608. In certainalternative embodiments, reference module 902 determines unused orunallocated data block information from the data block usage informationand provides the unused or unallocated data block information to thesynchronization module 904.

Next, the synchronization module 904 synchronizes 1304 data block usageinformation managed by a storage controller 616 with the data blockusage information maintained by the storage manager 608. The storagemanager 608 maintains the data block usage information separate fromdata block usage information managed by the storage controller 616. Thesynchronization module 904 may synchronize the data block usageinformation based on a RAID configuration 704 if the storage controller616 is a RAID storage controller 702. Then, the method 1300 ends.

FIG. 13B is a detailed schematic flow chart diagram illustrating anotherembodiment of a method 1350 for data block usage informationsynchronization for a non-volatile storage volume in accordance with thepresent invention. The description of the method 1350 refers to elementsof FIGS. 6, 9, and 10, like numbers referring to like elements.

The method 1350 begins and the reference module 902 references 1352 datablock usage information for data blocks of a non-volatile storage volumemanaged by a storage manager 608. Next, the update module 1006 updates1354 the data block usage information based on storage operations thatmodify the data block usage information. These “in-flight” storageoperations are those operations that are executed by the storagecontroller 616 subsequent to referencing the data block usageinformation and executed by the storage controller 616 prior tosynchronizing the data block usage information.

The block determination module 1002 then determines 1356 one or moreunused blocks from the data block usage information, which includes thedata block usage information from the in-flight storage operations. Theblock determination module 1002, in certain embodiments, may determinethe unused blocks as those that are freed blocks versus those that arefree blocks. If the deallocation module 1004 is configured to directlyperform deallocation 1358 on the blocks, the deallocation module 1004deallocates 1360 used blocks identified by the storage controller 616corresponding to unused blocks identified by the data block usageinformation.

Alternatively, the deallocation module 1004 sends 1362 a messagedirectly to the storage controller 616 directly managing thenon-volatile storage volume. The message indicates unused blocksidentified by the data block usage information obtained from the storagemanager 608 and updated by the update module 1006. The storagecontroller 616 then deallocates 1364 the identified unused blocks inresponse to the message and the method 1350 ends.

FIG. 14 is a schematic flow chart diagram illustrating an embodiment ofa method 1400 for data management on non-volatile storage media managedby a storage manager 608 in accordance with the present invention. Thedescription of the method 1400 refers to elements of FIGS. 6 and 11,like numbers referring to like elements.

The method 1400 begins and the reference module 1102 references 1402 ablock map defining data block usage information for data blocks ofnon-volatile storage media managed by a storage manager 608. The blockmap 604 is maintained by the storage manager 608 and may be referencedthrough functionality provided by the storage manager 608. Next, themessage 1104 module sends 1404 a message directly to a storagecontroller 616. The message includes unused block information indicatingto the storage controller 616 the unused blocks identified by the datablock usage information of the block map 604. The message module 1104may send one or more messages to one or more RAID storage controllers702 and/or sub-controllers 705 based on a RAID configuration. Then, themethod 1400 ends. Depending on the type of message sent, the storagecontroller 616 may then determine whether to act on the unused blockinformation in the message, comply with the message and act, and/orcomply with the message by performing a secure erase of the data on themedia for the unused block information.

FIG. 15 is a detailed schematic flow chart diagram illustrating anotherembodiment of a method 1500 for data management on non-volatile storagemedia managed by a storage manager 608 in accordance with the presentinvention. The description of the method 1500 refers to elements ofFIGS. 6, 11 and 12, like numbers referring to like elements.

The method 1500 begins and the reference module 1102 selects 1502 a setof logical blocks for analysis. For example, the reference module 1102may select a set of logical blocks during a progressive scan of logicalblock addresses of a volume.

Then, the reference module 1102 references 1504 a block map 604 definingdata block usage information for a set of data blocks of non-volatilestorage media managed by a storage manager 608. The non-volatile storagemedia may be solid-state storage media 110 such as flash. The block map604 is maintained by the storage manager 608 and may be referencedcalling a function of a storage API of the storage manager 608.

Next, the monitor module 1204 monitors 1506 storage operations on datablocks represented by the block map 604 to detect in-flight operations,or operations executed by the storage controller 616 subsequent toreferencing the block map 604 and executed by the storage controller 616prior to deallocating blocks for a storage volume. The record module1206 then records 1508, in an in-flight block map 612, data block usageinformation for the monitored in-flight storage operations that changeunused blocks to used blocks.

Next, the lock module 1212 obtains 1510 a lock on a logical-to-physicalmap or other address mapping index. In one embodiment, the lock module1212 obtains the lock on the logical-to-physical map to keep otherin-flight storage operations from simultaneously updating thelogical-to-physical map and/or the combined block map 614. The mapcombination module 1208 then combines 1512 the block map 604 and thein-flight block map 612 into a combined block map 614 to update the oneor more unused blocks of the data blocks. As a result, the data blockusage information provided by the storage manager 608 as a snapshot isupdated to account for operations executed before the storage controller616 or deallocation module 1210 deallocates in accordance with the datablock usage information.

The storage controller 616 deallocates 1514 the unused blocks identifiedby the combined block map 614. The storage controller 616 may deallocatethe unused blocks in response to a message sent by the message module1104 identifying the unused blocks. Alternatively, the deallocationmodule 1210 may directly deallocate 1514 used blocks on the storagecontroller 616 that correspond to the unused block informationidentified by determination module 1202. The lock module 1212 releases1516 the lock on the logical-to-physical mapping and the method 1500ends. The method 1500 may be repeated for various sets of logical blocksduring, for example, a progressive scan of logical block addresses inone or more volumes.

The present invention may be embodied in other specific forms withoutdeparting from its spirit or essential characteristics. The describedembodiments are to be considered in all respects only as illustrativeand not restrictive. The scope of the invention is, therefore, indicatedby the appended claims rather than by the foregoing description. Allchanges which come within the meaning and range of equivalency of theclaims are to be embraced within their scope.

What is claimed is:
 1. An apparatus for managing data, the apparatuscomprising: a write request receiver module that receives a storagerequest from a requesting device, the storage request comprising arequest to store a data segment in a storage device, the data segmentcomprising a series of repeated, identical characters or a series ofrepeated, identical character strings; and a data segment token storagemodule that stores a data segment token in the storage device, the datasegment token comprising a data segment identifier and a data segmentlength, the data segment token being substantially free of data from thedata segment.
 2. The apparatus of claim 1, wherein the storage requestcomprises a tokendirective to store the data segment token, the storagerequest being free from data of the data segment.
 3. The apparatus ofclaim 2, wherein the data segment token storage module generates thedata segment token prior to storing the token, wherein the data segmenttoken storage module generates the data segment token from informationin the token directive, the token directive being free from the datasegment token.
 4. The apparatus of claim 2, wherein the token directivecomprises the data segment token and the data segment token storagemodule recognizes that the data segment token represents the datasegment.
 5. The apparatus of claim 1, wherein the storage requestcomprises data from the data segment and further comprising a tokengeneration module that generates a data segment token from the datasegment, the data segment token created in response to the storagerequest to store the data segment.
 6. The apparatus of claim 5, whereinthe token generation module resides at the requesting device.
 7. Theapparatus of claim 1, further comprising a secure erase module thatoverwrites existing data with characters such that the existing data isnon-recoverable, the existing data comprising data of a data segmentpreviously stored on the storage device identified with the same datasegment identifier as the data segment identified in the storagerequest.
 8. The apparatus of claim 7, wherein the secure erase modulefurther comprises an erase confirmation module that transmits a messageindicating that the existing data has been overwritten, the eraseconfirmation message transmitted in response to the secure erase moduleoverwriting the existing data.
 9. The apparatus of claim 7, wherein thesecure erase module overwrites the existing data during a storage spacerecovery operation.
 10. The apparatus of claim 7, wherein the storagerequest further comprises a request to overwrite the existing data andwherein the secure erase module overwrites the existing data in responseto the request to overwrite the existing data.
 11. The apparatus ofclaim 1, further comprising: a read request receiver module thatreceives a storage request to read the data segment; a read data segmenttoken module that reads the data segment token corresponding to the datasegment requested by the storage request; and a read request responsemodule that transmits a response to the requesting device, the responsegenerated using the data segment token corresponding to the requesteddata segment.
 12. The apparatus of claim 11, wherein the read requestresponse module further comprises a transmit data segment token modulethat transmits in the response a message to the requesting device, themessage comprising at least the data segment identifier and the datasegment length, the message being substantially free from data of thedata segment.
 13. The apparatus of claim 11, further comprising areconstitute data segment module that reconstitutes data of the datasegment using the data segment token, and wherein the read requestresponse module further comprises a transmit data segment module thattransmits the reconstituted requested data segment.
 14. The apparatus ofclaim 1, wherein the series of repeated, identical characters orcharacter strings indicate that the data segment is empty.
 15. Theapparatus of claim 1, wherein the storage request further comprises arequest to reserve storage space on the storage device, the requestedreserved storage space comprising an amount of storage spacesubstantially similar to the data segment length, and further comprisinga storage space reservation module that reserves an amount of storagespace on the storage device consistent with the request to reservestorage space.
 16. The apparatus of claim 1, wherein the empty datasegment token comprises an entry in an index, the index corresponding toinformation and data stored on the storage device.
 17. The apparatus ofclaim 1, wherein the data segment token comprises an object stored onthe storage device.
 18. The apparatus of claim 1, wherein the datasegment token comprises metadata stored on the storage device.
 19. Theapparatus of claim 1, wherein the data segment token further comprisesat least one of a data segment location indicator, at least one instanceof the repeated, identical character, and at least one instance of therepeated, identical character string.